Probabilistic Thinking

Training yourself to think in probabilities is hard to do without feedback. An effective way to train painlessly is to turn the estimation of credences into a game, where you get a numerical score that increases when you’re doing well. Then you just need to play often and your score will naturally go up. However, not all scores are equally good for this matter. We have even seen that some, like linear scoring, can be actively harmful, for instance by incentivizing overconfidence. The property we need to build a good scoring system is properness. A score is proper if it incentivizes truth-telling, that is to say our score increases when the estimation of our credences is more accurate. Good news, any proper score will do the job just fine, and there’s plenty of them !

Nonetheless, I am particularly fond of one of them: the logarithmic score, for it has tons of more interesting properties than just properness. In this post, we will go through some of the most impressive of them.

WARNING: This is a more technical post than the others in the series. If you did not follow through the proofs in the introduction, you may want to skim over some parts or skip this post entirely. If you’re in but would like a quick reminder on Lagrangian duality, there is a very good set of Lecture Notes by Laurent El Gahoui for Berkeley’s EE227A course on convex optimization.

Uniqueness under locality constraint

The logarithmic scoring rule has a property that we have overlooked until now: it is local. If the correct answer is \(X = k\), then the score \(S(q, X)\) depends only on \(q_k\). This means that the distribution of weight \(q_j\) on answers \(j \neq k\) does not matter, \(q_A = (.20, .20, .60, 0)\) and \(q_B = (0, .40, .60, 0)\) will get the same score if \(X = k = 2\). This might feel like a questionnable choice, but it’s also arguably the only consistent one. Indeed, without any further information, there is no way to tell the difference between a student leveraging real information to cross out the first answer \((j=0)\) and a student overconfidently increasing the weight on the second answer \((j=1)\) without a valid reason, yet both would give weights \(q_B\). Should that get more or fewer points than \(q_A\) then ? This undistinguishability can be treated consistently by awarding the same score to \(q_\alpha = (.40 \times \alpha, .40 \times (1 - \alpha), .60, .0)\) for all \(\alpha \in [0, 1]\). This property is called locality. A scoring rule is local if the score does not depend on the repartition of assigned weights among answers of probability zero.

This is about to get a bit technical, so feel free to skip to the conclusion if you don’t feel like following along. In that case, long story short, the logarithmic scoring rule is the only scoring rule that is both local and proper (up to a multiplicative and additive constant).

Otherwise, note that a generic score \(S : \Delta_n \times [n] \to \mathbb{R} \cup \{ \pm \infty \}\) is a local score exactly when it is written as \(S(q, k) = f(q_k)\), where \(f\) is a function \(f : [0,1] \to \mathbb{R} \cup \{ \pm \infty \}\).

Claim: If \(n \geq 3\) and \(S\) is local and proper, then there are constants \((\lambda, c) \in \mathbb{R}_+^* \times \mathbb{R}\) such that \[S(q,k) = \lambda \cdot \log(q_k) + c\]

Observe that \(n=2\) is excluded, because all scores are local for binary questions (since \(p_0 = 1- p_1\) determines both probabilities at once, the idea of locality is meaningless). The idea is then to use the properness to get a constraint on \(f\), and conclude from there \[ \Bigl[ {\arg\!\max}_q \> \mathbb{E}_{X \sim p} \left[ f(q) \right] = p \Bigr] \, \Rightarrow \, \Bigl[ \forall u \in ]0,1[, \> f’(u) = \lambda / u \Bigr] \]

Proof. We will assume for simplicity that \(f\) is continuously differentiable on \(]0,1[\). The Lagrangian for the maximization of \( \sum_i p_i f(q_i) \) with the usual simplex constraints \( q_i \geq 0 \) and \( \sum_i q_i = 1 \) is

\[ \mathcal{L}(q, \lambda, \mu) = - \sum_i p_i f(q_i) + \lambda \left(\sum_i q_i - 1\right) - \sum_i \mu_i q_i \]

We can easily compute is derivative with respect to \(q\), which is

\[ \frac{\partial \mathcal{L}}{\partial q_i} (q, \lambda, \mu) = - p_i f’(q_i) + \lambda - \mu_i \]

Then, the properness condition then states that the derivative of this Lagrangian with respect to \(q\) is cancelled at \(q = p\), which gets the maximal score, so we get the equation

\[ 0 = - p_i f’(p_i) + \lambda - \mu_i \]

We can restrict to all \(p_i\) strictly between 0 and 1, for which the complementary slackness conditions \(\,\mu_i p_i = 0\,\) imply that \(\,\mu_i = 0\,\), and we are left with the defining equation for \(f\)

\[ \forall i, \quad p_i \neq 0 \Rightarrow p_i f’(p_i) = \lambda \]

This is almost the result we need to conclude the proof !

Problem: The Lagrange lambda is \(p\)-dependent, we need to show that it is independent of \(p\).

Now we will use the assumption \(n \geq 3\). Note that we can without loss of generality pad \(p\) with all zeroes, so we may simply use \(n=3\), and write \(p = (a,b,c) \in \mathbb{R}_+^3\), where \(a+b+c=1\).

For shortness, let \(\lambda_0 = \frac{1}{2} f’\left(\frac{1}{2}\right)\). Let us show that for all \(u \in ]0,1[\), it holds \(u f’(u) = \lambda_0\).

Let’s start by shrinking that interval. Let \(u \in ]1/2, 1[\), and apply the previously obtained equation by Lagrange optimality to the probability distribution \(p = (u, 1-u, 0)\). We get that \( u f’(u) = (1-u) f’(1-u) \). Therefore, it only remains to show that for all \(u \in ]0, 1/2[\), it holds that \(u f’(u) = \lambda_0\), since the other half follows by symmetry.

For the final part, let \(u \in ]0, 1/2[\), and apply the previous result to the distribution \(p = (u, 1/2 - u, 1/2)\). This time we get \(u f’(u) = \frac{1}{2} f’\left(\frac{1}{2}\right) = \lambda_0\).

Thus as claimed, there is a unique constant \(\lambda \in \mathbb{R}\) such that \( f’(u) = \lambda / u \) for all \(u \in ]0,1[\). It only remains to check that \(p\) being a strict maximum of \(q \mapsto \sum_i p_i f(q_i)\) implies that \(\lambda > 0\).

Conclusion. There is exactly one function (up to an additive constant) satisfying this equation, therefore \(f : x \mapsto \lambda \cdot \log(x) + c\) for some \(\lambda \in \mathbb{R}_+^* \) and \(c \in \mathbb{R}\).

Compatibility with conditional probability

The logarithmic score makes additive on the score the changes that are multiplicative on the probabilities, which plays well with conditional probabilities. This has a very useful property: we can split questions and score them separately without altering the overall score.

Additive separation of log-score: \( \> \log\bigl(p(A \mid B)\bigr) + \log\bigl(p(B)\bigr) = \log\bigl(p(A,B)\bigr) \)

Example: Who will win the US presidential election of 2020 ? (Asked in January 2020).

Since Donald Trump is president at the time of answering, he will presumably be the republican candidate. As for the democratic party, there are five major candidates at the primary election: Joe Biden, Bernie Sanders, Elizabeth Warren, Michael Bloomberg and Pete Buttigieg.

Rather than estimating the six marginal probabilities in one go, it may be fruitful to consider the first question: “who will win the democratic primary?” (five answers), and later answer separately the questions “if X wins the democratic primary, who will be elected president, X or Trump?”.

Naturally, any user giving probabilities to each event could recombine the marginal probabilities for their estimate, and log the result into the scoring software. But with a logarithmic rule, each question can simply be scored separately, with a zero-score for the unrealized duels. The number of simultaneous answers to handle goes from six to five in this example, but could be cut into much smaller pieces in other cases, for instance if more candidates are considered for each party.

Counter-example with Brier score:

Consider a simplification of the previous example with two democratic candidates: Joe Biden and Bernie Sanders. Let us assume that our estimate of Trump winning against either is \(q_T \in [0,1]\). What is the difference in Brier score (sum of squares) between scoring jointly versus separately ?

If our estimate of Biden’s victory in the primary is \(q_B \in [0,1]\), the probability of each outcome is:

(T > B) and (B > S), probability \(\, q_T \cdot q_B\)
(T > S) and (S > B), probability \(\, q_T \cdot (1 - q_B)\)
(B > T) and (B > S), probability \(\, (1 - q_T) \cdot q_B\)
(S > T) and (S > B), probability \(\, (1 - q_T) \cdot (1 - q_B)\)

If it turns out that Biden wins against Sanders but loses to Trump, the joint score is \[\begin{aligned} (1 - q_T q_B)^2 + (q_T (1-q_B))^2 + ((1-q_T) q_B)^2 + ((1-q_T)(1-q_B))^2 \end{aligned}\]

On the contrary by scoring separately we get \( (1 - q_T)^2 + (0 - (1-q_T))^2 \) for the second term and \((1 - q_B )^2 + (0 - (1-q_B))^2\) for the first term, for a separate-scoring total of \[ 2 (1 - q_T)^2 + 2 (1 - q_B)^2 \]

Plotted below are the two options, for \(q_T = 0.5\). The two scores (upside-down so higher is better) are maximized at \(q_B \to 1\) as expected, but the scores are different everywhere else, so a choice must be made between the two options.

Note that this difference does not mean that either score is not proper. If we compute the expected score for \(q_B = 0.8\) as a function of the reported probability for Biden’s win in the primary, we see that the expected score is still maximized by truth-telling.

It’s quite unclear which of these scorings should be selected, or if there is a principled option at all. The nice invariance property of the logarithmic score means that no such choice is necessary.

Local flatness near certainty

When giving a confidently correct answer, one must properly estimate the level of confidence, and risk losing points if it’s too low. With a log-score, confidently-correct answers are not very sensitive to the precise decimals of the probability assigned.

For instance, with a score \(S(q,k) = \log_{10}(q_k)\), a correct answer with assigned probability 90% gives \((-0.046)\) points, but probability 99% is \((-0.004)\). Thus, assigning 90% probability instead of 99% only loses 4 hundredth of a point. On the other hand if the answer had been incorrect, going from 10% to 1% would have gone from \((-1)\) to \((-2)\), which loses a whole point.

This is because the derivative of log-scores near a correct answer is much smaller than in the uncertain region. That implies that when answering questions with such scores, we can spend more time on correctly estimating the uncertainty of answers we are unsure of, rather than wasting time thinking about the decimals of answers that are nearly-certain. This also has a nice overconfidence-avoiding effect of not pushing probabilities too close to one when we are not quite sure, because doing so does not gain many points, but risks losing much more if we are wrong.

Bayes factors and difference of scores

The main property of a score is that its expectation is maximized by truth-telling (i.e. the score is proper), to ensure that the maximization of the score results in accurate reporting of credences. Other than that, the precise value of a score is not necessarily meaningful, and as such a difference of scores can be hard to interpret beyond just which of two values is larger.

With a logarithmic scoring rule, on the contrary the difference of logs is directly interpreted as the logarithm of a ratio of predicted probabilities. With a score \(S(q,k) = \log_{10}(q_k)\), a difference of \((+1)\) points corresponds to a probability assigned to the correct answer which is ten times larger, for instance a 2% compared to a 20%. Similarly, a difference of \((+0.3)\) points indicates an answer with twice the probability, for instance 2% compared to 4%, and that gap is additive so a difference of \((+0.6)\) points is a corresponding ratio of four. This gives an immediately interpretable meaning to the numerical value of differences, corresponding to the logarithm of a Bayes factor, and helps to understand which scores are meaningfully different or not.

Connection to KL divergences

As a bonus, observe that log-scoring rules are (up to affine transformation) a negative Kullback-Leibler divergence \(\operatorname{KL}(p \mid\mid q)\) of the predicted distribution \(q\) with the outcome distribution \(p\).

This will be a neat point of view to use when extending to uncertain corrections or continuous predictions, but it’s too early for that at the moment.

Conclusion

The logarithmic scoring rule is the only proper score that is also local (up to an affine transformation). It is sometimes set aside as being “too brutal” since the worst score one can get is \((- \infty)\), which is essentially an unrecoverable score. However, this can instead be viewed as a reflection of the true nature of probablistic thinking: one cannot be sure of something that is not true. Accepting this leads to taking into account all possible outcomes, never predicting probabilities of zero for non-impossible outcomes, no matter how unlikely, and generally opens the door to many more interesting properties, such as locality of the score and compatibility with conditional probabilities, but also a more forgiving flatness around near-certain outcomes.