This is a linkpost for https://aizi.substack.com/p/log-odds-are-better-than-probabilities

Log-odds are better than Probabilities

5mruwnik

2philh

5SimonM

2mruwnik

New Comment

I feel in all these contexts odds are better than log-odds.

Log-odds simplifies Bayesian calculations: so does odds. (The addition becomes multiplication)

Every number is meaningful: every *positive* number is meaningful and the numbers are clearer. I can tell you intuitively what 4:1 or 1:4 means. I can't tell you what -2.4 means quickly, especially if I have to keep specifying a base.

Certainty is infinite: same is true for odds

Negation is the complement and 0 is neutral: Inverse is the complement and 1 is neutral. 1:1 means "I don't know" and 1:x is the inverse of x:1. Both ot these are intuitive to me.

Check out what Jaynes has to say on the topic (section 4.2 here, page 120 or 90, depending on which you're looking at). It's pretty much the same thing, but he goes a bit deeper (as always...).

[This is a cross-post from my blog at aizi.substack.com. I'm sure someone has made a point like this before, but I don't know any specific instances and I wanted to give my take on it.]

At my previous job I worked on ML classifiers, and I learned a useful alternative way to think about probabilities which I want to share. I’m referring to

log-odds aka logits, where a probabilitypis represented by logit(p):=log(p/(1-p))^{[1]}.I claim that, at least for Bayesian updates and binary prediction, it can be better to think in terms of log-odds than probabilities, and this post is laying out that case.

Log-odds simplifies Bayesian calculationsDo you do

\[P(H|E)= \frac{P(H)P(E|H)}{P(E)}\]Bayesian updatesin your head? I didn’t, in part because the classic Bayes formula is kinda bad to work with:The first problem is that you need to know

\[P(H|E)=\frac{P(H)P(E|H)}{P(E|H)P(H)+P(E \neg H)P(\neg H)}\]P(E), the chance thatEis true at all. Butthevalue ofP(E)should be irrelevantsince we know we live in a timeline where E is true! Of course you can use a formula like this to hideP(E)but at a complexity cost:For me, this calculation requires too many operations and cached numbers to do easily in my head.

But more importantly, these formula

don’t emphasize howP(H)was updated. Sure, you can sayP(H)is being multiplied byP(E|H)/P(E),but that number isn’t really comparable across priors. For instance, ifP(E|H)/P(E)=2, that’s a small update if your prior isP(H)=.1 (taking you from 10% to 20%), a huge update ifP(H)=.5 (taking you from a coinflip to certainty), and impossible forP(H)>.5. So “P(E|H)/P(E)=2” isn’t a meaningful intermediary calculation.Now let’s compare the log-odds version. I’ll write

\[\begin{eqnarray*} L(H|E) &=& \log\left( \frac{P(H|E)}{P(\neg H | E)}\right)\\ &=& \log\left( \frac{\left(\frac{P(H)P(E|H)}{\cancel{P(E)}} \right)}{\left( \frac{P(\neg H)P(E| \neg H)}{\cancel{P(E)}}\right)}\right)\\ &=&\log(\left(\frac{P(H) P(E|H)}{P(\neg H)P(E|\neg H)} \right)\\ &=& \log \left( \frac{P(H)}{P(\neg H)} \right)+\log \left( \frac{P(E|H)}{P(E|\neg H)} \right) \\ &=& L(H)+ \log \left( \frac{P(E|H)}{P(E|\neg H)} \right) \end{eqnarray*}\]L(H)for logit(P(H)):Omitting intermediate steps:

\[\begin{eqnarray*} L(H|E) &=& L(H)+ \log \left( \frac{P(E|H)}{P(E|\neg H)} \right) \end{eqnarray*}\]Now that’s clear!

A Bayesian update is justadding a new term, the log-ratio of seeing this evidence when the hypothesis is true vs when its false. For me, this is a very easy calculation to do in my head (only two binary operations and a log, and you don’t need to cache numbers between steps), and when I had to do Bayesian updates in my head I would convert to log-odds space and calculate them there.But I want to claim something stronger:

the sheer simplicity of the log-odds Bayes rule suggests we’re thinking in the right terms. Our intermediate calculation log(P(E|H)/P(E| -H)) is comparable across priors and it connects in an intuitive way to what’s happening in the world. If we call that term “the strength of the evidence” (a name I think is justified), Bayesian updating is literally “adding the strength of the evidence to your prior”. That’s great! As a mathematician, I’d say this isso great(in terms of its elegance, simplicity, correspondence to our natural language, etc) that it’s a sign we’ve found the “right definition”.That’s my main argument, but there are other minor perks too.

Probability changes lack meaning without base rates“We’ve improved classification accuracy by 10 percentage points”. Is that good or bad? Taking a classifier from 50/50 correct/incorrect to 60/40 is a small improvement, but taking it from 89/11 to 99/1 is a massive improvement! The problem is you really want to measure the change in both correct and incorrect classes, simultaneously. Log-odds do that because they’re a function of

p/(1-p).Every number is meaningfulLog-odds space is the real line, which corresponds to probabilities in the open interval (0,1). Therefore you can be confident that any rescaling or shifting you do to finite log-odds will result in a new meaningful number, whereas for probabilities you have to be careful never to leave the interval [0,1]. The fact that you can’t uniformly increase a probability by 10% (or 10 percentage points) is an indication they’re not the “right” way to think of things.

Certainty is infinite, and there’s a lot of space near infinityProbabilities of 0 and 1 correspond to log-odds negative infinity and positive infinity, respectively. This is good because it reminds us that complete certainty is qualitatively different than any amount of uncertainty. For instance, it’s easy to see how any updating from certainty is like adding a finite number to infinity - it still results in infinity.

Also, very-high-confidence predictions are spread out in a sensible way in log-odds space. Predictions of 99% and 99.9% sound very similar in terms of probabilities, but in log-odds space they are ~2 and ~3 hartleys respectively, showing that the second one is much more confident.

Negation is the complement and 0 is neutralThe complement operation (the odds of “not X”) on probabilities is

P’=1-P, resulting in a neutral point at .5 (i.e. 50/50 odds). This is okay, but log-odds space wins because the compliment operation isL’=-L, so the neutral point is 0. This is more aesthetically pleasing (and maybe has other benefits idk).Probabilities are still good for other thingsI hope I’ve convinced you that log-odds are a useful substitute for probabilities in some situations. However, I don’t want to pretend you should think of everything in terms of log-odds. Probabilities have some real perks, especially in cases where there are three or more options to track, so I wanted to shout out some of those:

^{[2]}.^{[2]}. Similarly, there’s no rule like “the total area under aPDFis 1” for log-odds.convolutionsand other fundamental calculations, and I don't know of anything like this for log-odds^{[2]}.^{^}The choice of log base doesn’t matter as long as you’re consistent, and the resulting units are called shannons/nats/hartleys for bases 2/e/10 respectively.

^{^}Without cheating by converting your log-odds into probabilities.