Dealing with uncertainty is hard. In most cases, people will try to dismiss the uncertainty, either by saying they know the answer when they don’t, or the complete opposite, claiming nobody knows anything and no decision can be taken, ignoring available evidence. Clearly there is a wide range of options in between, and we need efficient ways to deal with them.

I want to focus here on a particular technique for that matter: thinking in probabilities. That is to say, assigning a probability to all events, even if they do not involve any randomness, to quantify the amount of uncertainty around a possible event. Getting used to this frame of mind takes a bit of work, but is well worth the effort. To train yourself to compute these probabilities naturally, consider the following scenario. You have just taught an introductory class on astronomy and wish to measure the understanding of your audience. You have certainly already encountered multiple-choice questions, which force students to give an answer regardless of whether or not they really know the answer, but also and more importantly, regardless of how well they understood the underlying idea. These are exactly the drawbacks we are trying to avoid.

In this series of posts, we will work through a probabilistic question-answering framework instead, which incentivizes users to think in probability distributions rather than deterministically, and measures much more accurately their true level of understanding. This scenario generalizes well to more realistic settings where you are in a way both teacher and student, or where the answer is unknown. The focus is on how to accurately estimate your levels of uncertainty, why you should do so, and how to tell which estimated distributions are good or not. With these key points worked out, you will be able to understand and accept your uncertainty, and communicate it effectively to others.

Deterministic answering

Before building a new framework for our surveys, it is important to understand why such a change is necessary to analyze the students’ uncertainty. To do so, we will dig into the drawbacks of the original framework through a simple example with five students. For concreteness, consider the following multiple-choice question on astronomy

Which of these planets is closest to our solar system’s sun?

(1) Earth, (2) Mars, (3) Mercury, (4) Venus

You get the answers (Mars, Mercury, Mercury, Venus, Mercury). How much information does this carry about your audience ? There is a majority of correct answers and nobody answered “Earth”, so apparently your message did somewhat get through, and it is clear to everybody that Earth is not the closest planet to the sun. Or is it really ? How do you know that the “Mars” answer is not somebody totally clueless who answered at random, and might as well have answered “Earth”? How many of the “Mercury” answers are people confident that this is the correct answer, and how many had a vague idea and just got lucky ?

It might seem that these problems come only from the low sample size, and would vanish when the number of answers increases. To show why this is not true, assume that we get the proportion of answers depicted in the figure below. Because there are many more answers, it is tempting to aggregate them to show only proportions as I have done in the figure, and stop thinking of each as an individual answer. This figure might appear to make the picture much simpler, with an abstract “whole class” entity showing 60% confidence that Mercury is the closest planet to the sun. This “average-student” picture is incorrect.

Fig 1. Distribution of deterministic answers of students

This idea of working with a probability distribution over answers is very interesting, but this proportion of answers in the entire audience is not the distribution you are looking for. Let us dig into the reasons for that.

First, the global picture is wrong, because you only get the “maximally believed” answer of each person, but no information on how strong this belief is. For instance, the 5% of “Earth” answers could come from the following two vastly different yet undistinguishable scenarios. On one hand, it might be that 20% of your audience has no idea what you are talking about, did not understand the question, or is feeling stressed out at the moment, and answers uniformly at random. Approximately a quarter of those people will answer “Earth”. On the other hand, there may be an odd Earth-centric cult in your area, whose adepts are very deeply convinced that Earth is the closest planet to the sun, and confidently state so in your poll. You just happen to have 5% of your audience belonging to this cult. There is no way to distinguish these two cases, and this is clearly a problem if you want to adapt to your audience.

Secondly, the individual picture is wrong. Merging deterministic answers into this aggregate distribution prevents you from appreciating the diversity of answers, and drawing conclusions on individuals. Think of it as trying to assign a grade to each person, how will you grade differently the various persons who got a correct answer ? Does the hard-working student who confidently identifies the correct answer get the same grade as the vaguely-convinced student selecting the same answer ? How many of the correct answers come from each of these two groups, and what does that 60% of correct answers really mean ? Quantizing the answers to correct/incorrect leaves you unable to say anything useful about a single student, precisely because each answer carries so little information about the understanding of the student. This motivates the aggregation of answers in the first place, because each individual answer is essentially useless, and the only reasonable grading strategy is to ask many more questions and hope that the luck eventually gets averaged out. These two drawbacks are very similar : too much information has been discarded by collecting a single answer per student, we need a way to ask questions that does not discard this valuable information.

If this example feels too ludicrous to be taken seriously, think of your favorite voting poll and read this section again with those questions in mind instead. What does this 60% approval rate really mean after all? I am highly confident that you have already encountered this issue numerous times during the past few years, even without identifying it as such at the time. When you start thinking about it, you will soon notice this problem almost everywhere.

Communicating credences

Let’s go back to the idea of probability distributions over answers. We have seen that we cannot get an “average-student” picture by aggregating the answers of the whole class. There is no imaginary student representative of the whole class that is 60% confident that Mercury is the correct answer. Now, what if we could access the true credence of every student, their individual “level of belief” of each answer ? Let’s assume that we have access to these credences, and see what we can do with it.

Fig 2. Distribution of true credences among students

This data gives a much clearer picture of the beliefs of your audience, even with only very few students. First, the three correct answers were fairly confident answers, the students largely understood and only need to build up a little more confidence. The two wrong answers are however of two very different kinds. Diego has most likely properly isolated Mercury and Venus as the first two, but is confused regarding their relative ordering, shown by the very small gap between the two credences. On the contrary, Alice has misidentified Mars as the answer with a worrying margin.

If the sole ability of treating students individually was not already sufficient to convince you, imagine the benefits of this information collected at a larger scale, and the “misunderstanding patterns” that it could help uncover, like the Mercury-Venus confusion which you might have otherwise missed. These are insights of a whole different scale than the ones given by the initial deterministic quizz.

It remains that we do not in general have access to these credences (they might not even be clear in the student’s mind!). Can we incentivize students to communicate their true credences in a survey? In other words, how do we grade probabilistic answers so that students get both a higher grade if they understood well, but also a higher grade for telling the truth ?

Proper scoring

Simply asking students to report their credences in a survey is not sufficient to get accurate estimates. A student with true credence \(p\) asked to communicate his credence in a survey might choose to instead report a credence \(q \neq p\). No need to be ill-intentionned for that, laziness is a sufficient explanation. Estimating one’s true credence often requires a significant effort, if there is no incentive to do it, it is safe to assume most people won’t make that effort.

We would thus like to design a scoring rule such that the maximum score is awarded when reporting one’s true credence. This would ensure that students willing to maximize their score report an accurate estimation of their true credence. But how exactly do we achieve that ? Let \(X\) be the answer to a given question, and \(S(q, X)\) the score assigned to distribution \(q\) over answers for that question. At the time of answering, a student with credence \(p\) does not know the correct answer, so their best guess is \(X \sim p\). This can be understood as follows: among the cases where the student is 80% confident that the answer is \(X = x\), answer \(x\) will indeed be correct 80% of the time, and this guess will be wrong 20% of the time. If the answer is drawn at random among these, then \(\mathbb{P}(X = x)\) is 80%, which corresponds to a random variable \(X \sim p\). Now the property that we want is clearer. For fixed true credence \(p\) and an answer \(X \sim p\), we want a score \(S(q, X)\) that is maximized on average when \(q = p\), incentivizing honest communication of credences.

A scoring rule \(S\) is said proper if that is satisfied, i.e. \(p = \text{argmax}_q\ \mathbb{E}_{X \sim p} \left[ S(q, X) \right]\).

Linear scoring

Let’s start with a simple system for concreteness. What if instead of awarding one point for a correct answer and zero for an incorrect answer, we score points proportionally to our credence, so 0.8 points when we estimate 80% confidence ?

Formalizing a bit, call \(p\) the probability distribution over answers corresponding to our true credence, and \(q\) the credence distribution that we report in the survey. Diego has \(p = (.10, .15, .35, .40)\) in our previous example for instance. The linear scoring rule is \(S(q, k) = q_k\). Now the answer turns out to be \(X = 2\), so if Diego was honest, he reported \(p = q\) and gets \(S(q, X) = q_2 = p_2 = 0.35\) points. But was that the best he could do ? Remember at the time of answering, Diego does not know the answer, so based on his current knowledge, he expects the answer to follow \(X \sim p\). His expected score is thus

\[ \mathbb{E}_{X \sim p} \left[ S(q, X) \right] = \sum_i p_i S(q, i) = \sum_i p_i q_i \]

This expected score is maximized by exagerating your maximum belief(s) until all the weight is concentrated on your maximum belief(s), which for Diego is \(q = (0,0,0,1)\)

If you are comfortable with constrained optimization, this is proved by canceling the gradients of the Lagrangian below, for the constraints \(q_i \geq 0\) and \(\sum_i q_i = 1\), followed by applying complementary slackness to get \((\forall i,\ \mu_i q_i = 0)\) at the optimum. If not, you can still understand this result intuitively by noticing that moving an amount of weight \(\Delta q\) from answer \(i\) to answer \(j\) results in a change of score \(\Delta q \times (p_j - p_i)\). In this setting, you can move all the weight to the answers with maximal value \(p_j = \max_k p_k\), and your score can only increase in doing so (you can distribute however you want among answers tied for maximum).

\[ \mathcal{L}(q, \lambda, \mu) = \sum_i p_i q_i + \lambda (1 - \sum_i q_i) - \sum_i \mu_i q_i \]

\[ \frac{\partial \mathcal{L}}{\partial q_i}(q, \lambda, \mu) = p_i - \lambda - \mu_i \]

\[ \frac{\partial \mathcal{L}}{\partial q_i} = 0 \Rightarrow \begin{cases} p_i = \lambda & \text{if} \ \mu_i = 0 \\ p_i \lt \lambda & \text{if}\ \mu_i \gt 0 \end{cases} \Rightarrow \lambda = \max_k p_k \]

For a fixed credence, a student’s best move when asked about their credence is lying. This is a terrible property, it incentivizes students to exagerate their credence when answering your survey, even if they have accurately approximated their true credence (assuming they even make that effort). On top of favoring deterministic thinking and polarization (which we’ll tackle later), it renders efforts to estimate the precious credences counterproductive. Clearly we cannot use this scoring rule and expect students to communicate their true credence.

Logarithmic scoring

Fortunately, there is a wide variety of proper scoring rules (i.e. rules that incentivize truth-telling). But since I’m not really interested in building a catalogue at this point, I will focus on one in particular for now: the logarithmic scoring rule.

\[ S(q, k) = \log q_k \]

\[ \mathbb{E}_{X \sim p} \left[ S(q, X) \right] = \sum_i p_i S(q, i) = \sum_i p_i \log q_i \]

Proof of properness

Let’s start by proving that it is indeed proper. Same technique as above, write down the Lagrangian with constraints \(q_i \geq 0\) and \(\sum_i q_i = 1\). (Watch out: we will minimize \( -S(q,X)\) which is convex instead of maximizing \(S(q,X)\)). We seek a Lagrangian optimal triplet \((q^*, \lambda^*, \mu^*)\) a solution to the saddle point problem \[\ \inf_q \sup_{(\mu \geq 0, \lambda)} \mathcal{L}(q, \lambda, \mu) \]

The expression of the Lagrangian under simplex constraints is \[ \mathcal{L}(q, \lambda, \mu) = - \sum_i p_i \log q_i - \lambda \left(1 - \sum_i q_i\right) - \sum_i \mu_i q_i \]

A small trick to avoid dealing with non-differentiability to start with. Putting zero weight \(q_i = 0\) on answer \(i\) with credence \(p_i \gt 0\) is clearly suboptimal. Even the slightest risk of losing infinitely many points yields an expected score of minus infinity. Formally,

\[ \exists i,\ (p_i > 0,\ q_i = 0) \Rightarrow \mathcal{L}(q, \lambda, \mu) = +\infty \]

We can thus restrict our analysis to \(( \forall i,\ q_i^* \gt 0 )\), by suppressing answers \(i\) where \(p_i = 0\) (if they had non-zero weight, the score can only increase by moving that weight elsewhere). Complementary slackness at optimum gives \(( \forall i,\ q_i^* \mu_i^* = 0 )\), thus \(( \forall i,\ \mu_i^* = 0 )\). It then suffices to cancel the gradient to get

\[ \frac{\partial \mathcal{L}}{\partial q_i}(q, \lambda, \mu) = - \frac{p_i}{q_i} + \lambda - \mu_i \]

\[ \left( \forall i,\ \frac{\partial \mathcal{L}}{\partial q_i}\left(q^*, \lambda^*, \mu^* \right) = 0 \right) \Rightarrow \left( \forall i,\ p_i = \lambda^* q_i^* \right) \Rightarrow \left( p = q^* \right) \]

The last step \(\lambda^* = 1\) is obtained by identifying the sums \(( \sum_i p_i = 1 = \sum_i q_i^* )\). This proves that the optimum is attained at \((q^* = p)\). The logarithmic scoring rule is thus proper: for a fixed credence \(p\), you get a maximum score when you tell the truth about your credence in the survey.

Differences with linear scoring

The truth-telling incentive is perhaps best understood visually. If we assume the credence \(p\) on a binary (yes/no) question is fixed, then the scores for the linear and logarithmic rules as a function of the reported credence \(q\) look like the following.

Fig 3. Distortion induced by linear scoring: lying on \(q_0\) gives a higher score

Here the left-most values represent the claim “answer A is certain”, and the right-most values correspond to the claim “answer B is certain”. The credence \(p_0 = 0.8\) (and thus \(p_1 = 0.2\) since the two sum to one) means 80% confidence that B is correct. The squares correspond to the reported credence giving a maximum score given this 80% confidence. The red square, for linear scoring, does not correspond to reporting \(q_0 = 0.8\). When scored linearly, there is an incentive to lie and overestimate our true credence.

Another nice property of the logarithmic score is also visible in this plot: the score slope is relatively flat around the maximum, so it does not matter much if students are unable to very precisely estimate their credence, they will get very similar scores with a rough approximation, the important part is staying out of highly-polarized answers like \(q_0 = 0\) or \(q_0 = 1\).

Punishment of overconfidence

I think it’s worth taking some time to digest the properties of this logarithmic scoring rule. It has two important properties. First, it is local: when the answer is \(i\), your score depends only on \(q_i\), never on \(q_j\) for \(j \neq i\) (more on this property later). Secondly, it decreases logarithmically. When the answer is \(i\) and your confidence that this is the correct answer is \(q_i\), it awards a score of \(\log q_i\). Let’s assume the logarithm is in base 10 to get some easy numbers. Your score goes down one point everytime you divide the probability \(q_i\) by ten, so you get \(-1\) point for \(q_i = 0.1\), then \(-2\) for \(q_i = 0.01\), \(-3\) for \(q_i = 0.001\), and so on.

That might be a bit counterintuitive, it even goes to minus infinity when \(q_i = 0\), which will be impossible to recover from! No matter how many questions you answer perfectly, a single such answer will put your score at minus infinity definitively. And that is very much intended. To understand why that is desirable, it is useful to see it the other way around. When assigning probability \(q_i\) to answer \(i\), you are equivalently saying that you believe answer \(i\) is incorrect with probability \((1 - q_i)\). So really \(q_i = 0.1\) indicates that you are 90% confident that \(i\) is not the answer, even though it is. You get \(-1\) point for that. If you persist, and insist that you are 99% confident that \(i\) is not correct, even though it is the correct answer, then you get \(-2\) points for that overconfidence. And this punishment of overconfidence increases steadily, if you state 99.9% confidence and you are wrong, this extra 0.9% will cost you another point on top of the two you already lost for saying 99%. It is then obvious that the natural extension to “total overconfidence” (\(q_i = 0\)) should be immediate exclusion from the scoreboard (a score of minus infinity). If you stated that answer \(i\) was impossible, and it happened, then it was clearly not impossible, and you lied when saying you knew it was impossible, but this scoring (by construction) punishes lies and rewards truth-telling.

The “paradox” of excluding a student based on only one answer is actually not one. It is the natural extension of the truth-telling incentive. If you don’t know, then you should say so. You will lose very few points for that. If on the contrary you repeatedly state 99% confidence and are proved wrong, then you were clearly not 99% sure that these answers were incorrect to begin with. You get a higher score by stating your true credence, what you really know, not what you want to believe you know. And you cannot be sure of something that is not true. This also goes both ways, there is no such thing as 100% confidence. Even when asked very simple questions in mathematics for which you could produce a proof, you can never guarantee that your proof does not contain mistakes. You can dramatically increase your confidence that the proof is correct, but never reach 100%, you need to account for all uncertainties. And that is a very interesting property.

Intellectual honesty and uncertainty

Overconfidence and survivorship bias

Many cognitive biases stem from our inability to judge a decision as a bet. Instead, we tend to focus on the outcome, and overrate those who predicted it by ignoring the uncertainty, at the expense of intellectual honesty. Take for instance the prediction of the outcome of flipping an unbiased coin. Out of a hundred persons claiming they are 100% sure that the outcome will be heads (or tails), fifty will be correct. Will you praise them for this correct prediction, or question their overconfidence ? The latter is hard to do based on a single outcome, maybe they really knew for reasons you don’t understand, you would need several repetitions to tell the difference. What if another person had predicted 50% heads / 50% tails, who would you focus on, the more confident prediction of the correct outcome ? Again, with a single outcome there is not much to say, but things become interesting with a second outcome. How will you treat the winners of the second round, will you ignore their failure in the first round, or reward the less impressive but consistent uncertain prediction ? Would your answer change if it had been six-sided dice instead of coins, or if there had been one chance in a million to win by pure luck ? Remember that given enough participants (or enough rounds), you are always assured of getting a winner no matter how low the probability of winning by chance. If you ignore overconfident failures and praise cherry-picked correct predictions, you will end up with survivorship bias, where a lottery winner tells you to never stop buying lottery tickets no matter what anyone tells you, because that is how you will get rich and they are proof of that claim. Yet, if you look at the bet instead of the outcome, then it becomes clear why that is a terrible strategy: out of the massive amount of players who followed this ticket-buying strategy, only one won. At the moment the bet was made, it was a terrible choice. But if the reward is high enough and you never hear of the punishment for being wrong (or underestimate it), you might be tempted to think that it’s a good strategy in retrospect. It is not. The same happens in multiple-choice questions.

Using inherently random examples like coin-flipping or lottery makes the problem clear, but the same happens in more intricate cases where you do not have all the information. Full-blown randomness is just the extreme case, where you not only do not have access to all the information, but even can not possibly have it. When querying a group of students with partial understanding of a subject, they are not guessing inherently random answers, but their incomplete understanding makes the situation similar. If you reward correct answers more than you punish incorrect ones, instead of favoring more consistent and less extreme predictions, you will end up with the same problems discussed above, with students learning to bet all or nothing on a single answer and hoping they win the lottery. On top of ruining your estimations of your audience’s understanding, this might teach them to extend this behavior to other parts of their lives, where overconfidence is seldom punished and survivorship bias is common.

Acknowledging uncertainty

With the linear scoring rule, there is no penalty for overconfidence, so when you don’t know you can safely pick any answer and hope for the best. On the contrary, with a scoring rule that favors acknowledging uncertainty, a more reasonable strategy might be to start from a uniform probability distribution over answers, and slowly move your credence up or down on each answer as you examine more and more evidence, eventually converging to your best estimate. This starting point of suspending your judgment until you have examined enough evidence is one of the core building blocks of science, and tends to limit confirmation bias (where you only accept evidence that confirms your prior belief and ignore the rest), because there is little to no prior belief by default, the safest default position is to acknowledge your ignorance.

Even as you examine more evidence and update your credence estimation, you will find yourself in the healthier position of never pushing a credence to an extreme position, because no single piece of evidence will ever allow you to guarantee that something is impossible, it will only move your cursor slighty further in that direction. This transforms the question from an answer-picking process into an estimation of levels of beliefs, favoring continuous evolution of credences for participants, which is also a great feature.

Changing one’s mind

To illustrate the benefits of continous credence evolution, see hereafter an example of potential evolution of credences of the student initially convinced that Mars is the closest planet to the Sun. The bars have been replaced with lines for readability, but the slopes are meaningless, only the points matter. The most important part of this figure is the fact that the credences do move over time, so the student not only wants to learn, but also reviews the evidence and updates their credence accordingly, in the direction we expect.

Fig 4. Example of continuous credence evolution

The change is not radical though, it happens slowly over time, with first a reversion to a state where credence is evenly distributed among Mars and Mercury, and only then a switch to a maximum on Mercury. When thinking too deterministically, changing one’s mind like this is much harder, because a single argument would need to have a huge impact to cause a change from answering Mars to answering Mercury, and any argument not having that high an impact could be dismissed entirely, because there is no way to represent a small change in confidence in a deterministic setting. Here the first change for instance is not sufficient to change the position of the maximum, but it does affect the credence a little. There need not be a massive argument changing radically one’s opinion here. An accumulation of small arguments can have the same effect, because intermediate states reflecting different levels of confidence are easy to represent. This mindset makes changing one’s mind easy, by giving a meaning to small variations.

This ability to change one’s mind incrementally has far more important consequences than just allowing a student to learn slowly over time. In politics, when thinking deterministically, a change of mind is always radical, and thus requires brutal arguments which are most of the time invalid and rely on emotional biases to have the desired impact. Since such powerful arguments are not available, debates are essentially useless, each party stays polarized on a given position, unable to affect other parties with weaker arguments and unaffected by theirs. On the contrary, when credences evolve incrementally, small valid arguments are sufficient, and it becomes possible to focus on actual evidence. A single piece of evidence will likely not radically change a participant’s mind, but that does not mean it does not have any effect. It is the accumulation of evidence that will have a lasting effect, as it should, and debates can stay focused on evaluating the validity of arguments, accepting or rejecting each independently, instead of trying to craft massive convincing-but-invalid arguments hoping to produce radical changes in other people’s minds. Quizzes with little to no consequences can be a good starting point to get used to this mindset.

Conclusion & references

This post has hopefully convinced you that answering questions deterministically is insufficient in settings where uncertainty matters, and that the technique of assigning a probability to each possible answer can be useful to tackle these cases. We have also seen that communicating these probabilities can pose challenges, because ill-calibrated rewards can incentivize people to lie about their uncertainty, and such rewards are common in our daily lives. However, rewards that incentivize truth-telling exist, which means we can build games to train ourselves to think in probabilities in a way that is consistent with our true credences. This training will have many wonderful consequences, such as forcing us to acknowledge uncertainty by default, diminishing our overconfidence, training us to recognize and avoid survivorship bias, and even easing the process of changing our minds more often.

If you’re in for the ride, the next posts in this series will guide you through the keys steps needed to efficiently leverage probabilistic question-answering tools, including training yourself to use these tools properly and analyzing results gathered with such tools. The next post, Properties of logarithmic scoring, digs into all the awesome properties of the log-score introduced in the previous section. If you did not follow through all the proofs above, then this post might be too technical to be useful to you. In that case, feel free to skip over to Analyzing scores and calibration, where we will learn how to assign precise numbers to our uncertainty, which is in general a very hard problem, but gets a lot easier with just a little bit of training.

Further reading

Probabilities and proper scorings are not new, and dealing with uncertainty has been a concern for various fields of research for quite a while already. The purpose of this series is to provide a self-contained introduction to these topics in a form that is easier to digest than a bunch of research papers all referencing each other in a web that may seem obscure to newcomers. If that’s no obstacle to you, then here are a few papers that you will likely enjoy reading.

More recent and accessible, there is also a book by Annie Duke: “Thinking in bets”, which carries the same probabilistic ideas through the example of a poker player placing bets on each possible outcome. You might also appreciate more interactive efforts from people at EPFL, see Bayesian examination on Less Wrong by Lê Nguyên Hoang, accompanied by a javascript app (Bayes Up by Louis Faucon), and a couple videos on Youtube (channel Science4All, in French).