The Principle of Predicted Improvement

Ronny Fernandez

I made a conjecture I think is cool. Mark Sellke proved it. I don't know what else to do with it, so I will explain why I think it's cool and give the proof here. Hopefully, you will think it's cool, too.

______________________________________________________________________

Suppose we are trying to assign as much probability as possible to whichever of several hypotheses is true. The law of conservation of expected evidence tells us that for any hypothesis, we should expect to assign the same probability to that hypothesis after observing a test result that we assign to it now. Suppose that $H$ takes values $h_{i}$ . We can express the law of conservation of expected evidence as, for any fixed $h_{i}$ :

E [P (H = h_{i} | D)] = P (H = h_{i})

In English this says that the probability we should expect to assign to $h_{i}$ after observing the value of $D$ equals the probability we assign to $h_{i}$ before we observe the value of $D .$

This law raises a question. If all I want is to assign as much probability to the true hypothesis as possible, and I should expect to assign the same probability I currently assign to each hypothesis after getting a new piece of data, why would I ever collect more data? A. J. Ayer pointed out this puzzle in The Conception of Probability as a Logical Relation (I unfortunately cannot find a link). I. J. Good solved Ayer's puzzle in On the Principle of Total Evidence. Good shows that if I need to act on a hypothesis, the expected value of gaining an extra piece of data is always greater than or equal to the expected value of not gaining that new piece of data. Although there is nothing wrong with Good's solution, I found it somewhat unsatisfying. Ayer's puzzle is purely epistemic, and while there is nothing wrong with a pragmatic solution to an epistemic puzzle, I still felt that there should be a solution that makes no reference to acts or utility at all.

Herein I present a theorem that I think constitutes such a solution. I have decided to call it the principle of predicted improvement (PPI):

E [P (H | D)] \geq E [P (H)]

In English the theorem says that the probability we should expect to assign to the true value of $H$ after observing the true value of $D$ is greater than or equal to the expected probability we assign to the true value of $H$ before observing the value of $D$ . This inequality is strict when $H$ and $D$ are not independent. In other words, you should predict that your epistemic state will improve (e.g. you will assign more probability to the truth) after making any relevant observation.

This is a solution to Ayer's puzzle because it says that I should always expect to assign more probability to the true hypothesis after making a relevant observation. It is a purely epistemic solution because it makes no reference to acts or utility. So long as I want to assign more probability to the true hypothesis than I currently do, I should want to make relevant observations.

Importantly, this is completely consistent with the law of conservation of expected evidence. Although for any particular hypothesis I should expect to assign it the same probability after performing a test that I do now, I should also expect to assign more probability to whichever hypothesis is actually true.

Aside from being a solution to Ayer's puzzle, the PPI is cool just because it tells you that you should expect to assign more probability to the truth as you observe stuff.

______________________________________________________________________

There is a similar more well known theorem from information theory that my friend Alex Davis showed me:

E [E [- log (P (H | D)) | D]] \leq E [- log (P (H))]

In English this says that you should expect the entropy of your distribution to go down after you make an observation. If we use the law of iterated expectation, multiply both sides by minus one, and reverse the inequality, we get something that looks a lot like the PPI:

E [log (P (H | D))] \geq E [log (P (H))]

It does not imply PPI in any obvious way because we can have two distributions such that one is higher in expected negentropy but lower in expected probability assigned to the truth, and vice versa. They are similar theorems in that one says you should predict that the probability assigned to the true outcome will be higher after an observation, while the other says you should predict that the log probability will be higher. They are different in that they use different measures of confidence.

The advantage of the PPI is that it is phrased in the same terms as Ayer's puzzle: probabilities rather than log probabilities. I also claim that the PPI is easier to read and interpret, so it might be pedagogically useful to teach it before teaching that expected entropy after an observation is less than or equal to current entropy.

Anyway, here's Sellke's proof.

______________________________________________________________________

Proof:

We want to show

E [P (H | D)] \geq E [P (H)] .

Let's say that $H$ takes values $(h_{i})$ and $D$ takes values $(d_{j})$ . The left hand side is

\sum i, j P [h_{i} | d_{j}] P [h_{i} \land d_{i}] .

The right hand side is

\sum i P [h_{i}]^{2} .

The left hand side is equivalent to

\sum i, j \frac{P [h_{i} \land d_{j}]^{2}}{P [d_{j}]} = \sum i \sum j \frac{P [h_{i} \land d_{j}]^{2}}{P [d_{j}]} .

Titu's lemma: for any sequences of a's and b's

\sum j (\frac{a_{j}^{2}}{b_{j}}) \geq \frac{(\sum_{j} a_{j})^{2}}{\sum_{j} b_{j}} .

If we apply this to each $\sum j$ then we just get $P [h_{i}]^{2}$ . This is because:

\sum j P [h_{i} \land d_{j}] = P [h_{i}]

and

\sum j P [d_{j}] = 1.

For each fixed i, set:

a_{j} = a_{i, j} = P [h_{i} \land d_{j}]

and

b_{j} = P [d_{j}] .

To conclude, for each fixed i we have

\sum j \frac{P [h_{i} \land d_{j}]^{2}}{P [d_{j}]} \geq \frac{(\sum_{j} P [h_{i} \land d_{j}])^{2}}{\sum_{j} P [d_{j}]} = P [h_{i}]^{2}

and hence

\sum i \sum j \frac{P [h_{i} \land d_{i}]^{2}}{P [d_{j}]} \geq \sum i P [h_{i}]^{2} .

Equality:

Here we explain why the only equality case is when $H$ and $E$ are independent.

Titu's Lemma is an equality iff the two vectors $(a_{1}, \dots, a_{n})$ and $(b_{1}, \dots, b_{n})$ are parallel, that is, if there exists a constant $λ$ such that $a_{i} = λ b_{i}$ for all $i$ . If we translate this equality condition over to our application of Titu's Lemma above, we see that our proof preserves equality if and only if there exist constants $λ_{1}, \dots, λ_{n}$ such that $P [h_{i} \land d_{j}] = λ_{i} \cdot P [d_{j}] .$ (We applied Titu once for each value of $i$ , so we need a $λ_{i}$ value for each inequality to be an equality. But these $λ_{i}$ 's can be different.)

Now if we sum over $j$ there we get

\sum j P [h_{i} \land d_{j}] = λ_{i} \cdot \sum j P [d_{j}]

and so

P [h_{i}] = λ_{i} .

Plugging this back in, we see that equality is true iff

P [h_{i} \land d_{j}] = P [h_{i}] \cdot P [d_{j}]

which is equivalent to independence of $H$ and $E$ , or $0$ mutual information.

E[P(H|D)]≥E[P(H)]

In English the theorem says that the probability we should expect to assign to the true value of H after observing the true value of D is greater than or equal to the expected probability we assign to the true value of H before observing the value of D.

I have a very basic question about notation -- what tells me that H in the equation refers to the true hypothesis?

Put another way, I don't really understand why that equation has a different interpretation than the conservation-of-expected-evidence equation: E[P(H=hi|D)]=P(H=hi).

In both cases I would interpret it as talking about the expected probability of some hypothesis, given some evidence, compared to the prior probability of that hypothesis.

I also had trouble with the notation. Here's how I've come to understand it:

Suppose I want to know whether the first person to drive a car was wearing shoes, just socks, or no footwear at all when they did so. I don't know what the truth is, so I represent it with a random variable $H$ , which could be any of "the driver wore shoes," "the driver wore socks" or "the driver was barefoot."

This means that $P (H)$ is a random variable equal to the probability I assign to the true hypothesis (it's random because I don't know which hypothesis is true). It's distinct from $P (H = h_{i})$ and $P (h_{i})$ which are both the same constant, non-random value, namely the credence I have in the specific hypothesis $h_{i}$ (i.e. "the driver wore shoes").

( $P (H = h_{i})$ is roughly "the credence I have that 'the driver wore shoes' is true," while $P (h_{i})$ is "the credence I have that the driver wore shoes," so they're equal, and semantically equivalent if you're a deflationist about truth)

Now suppose I find the driver's great-great-granddaughter on Discord, and I ask her what she thinks her great-great-grandfather wore on his feet when he drove the car for the first time. I don't know what her response will be, so I denote it with the random variable $D$ . Then $P (H | D)$ is the credence I assign to the correct hypothesis after I hear whatever she has to say.

So $E (P (H = h_{i} | D)) = P (H = h_{i})$ is equivalent to $E (P (h_{i} | D)) = P (h_{i})$ and means "I shouldn't expect my credence in 'the driver wore shoes' to change after I hear the great-great-granddaughter's response," while $E (P (H | D)) \geq E (P (H))$ means "I should expect my credence in whatever is the correct hypothesis about the driver's footwear to increase when I get the great-great-granddaughter's response."

I think there are two sources of confusion here. First, $H$ was not explicitly defined as "the true hypothesis" in the article. I had to infer that from the English translation of the inequality,

In English the theorem says that the probability we should expect to assign to the true value of H after observing the true value of D is greater than or equal to the expected probability we assign to the true value of H before observing the value of D,

and confirm with the author in private. Second, I remember seeing my probability theory professor use sloppy shorthand, and I initially interpreted $P (H)$ as a sloppy shorthand for $P (H = h_{i})$ . Neither of these would have been a problem if I were more familiar with this area of study, but many people are less familiar than I am.

I have a very basic question about notation -- what tells me that H in the equation refers to the true hypothesis?

H stands for hypothesis. We're taking expectations over our distribution over hypotheses: that is, expectations over which hypothesis is true.

Put another way, I don't really understand why that equation has a different interpretation than the conservation-of-expected-evidence equation: E[P(H=hi|D)]=P(H=hi).

In the PPI inequality, the expectations are being taken over H and D jointly, in the CEE equation, the expectation is just being taken over D.

I should note that when I first saw the PPI inequality, I also didn't get what it was saying, just because I had very low prior probability mass on it saying the thing it actually says. (I can't quite pin down what generalisation or principle led to this situation, but there you go.)

Yeah, I have intuitively the same interpretation.

My model is also that there is indeed lots of competing notational syntax in probability theory, and that some people would tell you that the current notation being used is invalid, or stands for something weird and meaningless. So I do think explaining the notation and the choice of notation in detail here is a good idea.

I honestly could not think of a better way to write it. I had the same problem when my friend first showed me this notation. I thought about using $" E [P (H = h_{t r u e})] "$ but that seemed more confusing and less standard? I believe this is how they write things in information theory, but those equations usually have logs in them.

Just to add an additional voice here, I would view that as incorrect in this context, instead referring to the thing that the CEE is saying. The way I'd try to clarify this would be to put the variables varying in the expectation in subscripts after the $E$ , so the CEE equation would look like $E_{D} [P (H = h_{i} | D)] = P (H = h_{i})$ , and the PPI inequality would be $E_{(H, D)} [P (H | D)] \geq E_{H} [P (H)]$ .

Yeah, this is the one that I would have used.

This was fun!

A related fact: suppose you have a simple random walk (let's say integer valued for simplicity, this all works with Brownian motion too) conditioned to reach (say) 100 before reaching 0. Then (at least before it has reached 100), from state n it has a (n+1)/2n chance to move up to n+1, instead of a 1/2 chance for the unconditioned walk. The proof is another helping of Bayes' Rule.

This model applies pretty directly if you think of a probability as a martingale in [0,1], and the conditioning as being secretly told the truth. So in this example you can explicitly quantify the drift toward the truth.

Does your principle follow from Goods? It would seem that it does. Perhaps a good way to generalise the idea would be that the EV linearly aggregates the distribution and isnt expected to change, but other aggregations like log get on average closer to their value at hypothetical certainty. For example the variance of a real parameter goes expected down.

I didn't take the time to check whether it did or didn't. If you would walk me through how it does, I would appreciate it.

Good shows that for every utility function for every situation, the EV of utility increases or stays the same when you gain information.

If we can construct a utility function where its utility EV always equals the the EV of propabilty assigned to the correct hypothesis, we could transfer the conclusion. That was my idea when I made the comment.

Here is that utility function: first, the agent mentally assigns a positive real number $r (h_{i})$ to every hypothesis $h_{i}$ , such that $\sum_{i} r (h_{i}) = 1$ . It prefers any world where it does this to any where it doesnt. Its utility function is :

2 r (H) - \sum j r (h_{j})^{2}

This is the quadratic scoring rule, so $r (h_{i}) = P (h_{i})$ . Then its expected utility is :

\sum i P (h_{i}) [2 P (h_{i}) - \sum j P (h_{j})^{2}]

Simplifying:

2 \sum i P (h_{i})^{2} - \sum j P (h_{j})^{2} \sum i P (h_{i})

And since

\sum_{i} P (h_{i}) = 1

, this is:

\sum i P (h_{i})^{2}

Which is just

E [P (H)]

I see. I think you could also use PPI to prove Good's theorem though. Presumably the reason it pays to get new evidence is that you should expect to assign more probability to the truth after observing new evidence?

I think this is very well done. The explanation is sufficiently clear that even I, the non-formal-math person, can follow the logic.

There is actually much easier and intuitive proof.

For simplicity, let's assume H takes only two values T(true) and F(false).

Now, let's assume that God know that H = T, but observer (me) doesn't know it. If I now make measurement of some dependent variable D with value d_i, I'all either:

1. Update my probability of T upwards if d_i is more probable under T than in general.

2. Update my probability of T downwards if d_i is less probable under T than in general.

3. Don't change my probability of T at all if d_is is same as in general.

(In general here means without the knowledge whether T or F happened, i.e. assuming prior probabilities of observer)

Law of conservation of expected evidence tells us that in general (assuming prior probabilities), expected change in assigned probability for T is 0. However, if H=T, than those events that update probability of T upwards are more likely under T than in general, and those which update probability of T downwards are less likely. Thus expected change in assigned probability for T > 0 if T is true.

QED

I had already proved it for two values of H before I contracted Sellke. How easily does this proof generalize to multiple values of H?

Very simple. To prove it for arbitrary number of values, you just need to prove that h_i being true increases its expected “probability to be assigned” after measurement for each i.

If you define T as h_i and F as NOT h_i, you just reduced the problem to two values version.

E[P(H|D)]≥E[P(H)]

In English the theorem says that the probability we should expect to assign to the true value of H after observing the true value of D is greater than or equal to the expected probability we assign to the true value of H before observing the value of D.

I have a very basic question about notation -- what tells me that H in the equation refers to the true hypothesis?

Put another way, I don't really understand why that equation has a different interpretation than the conservation-of-expected-evidence equation: E[P(H=hi|D)]=P(H=hi).

In both cases I would interpret it as talking about the expected probability of some hypothesis, given some evidence, compared to the prior probability of that hypothesis.

I also had trouble with the notation. Here's how I've come to understand it:

I think there are two sources of confusion here. First, $H$ was not explicitly defined as "the true hypothesis" in the article. I had to infer that from the English translation of the inequality,

In English the theorem says that the probability we should expect to assign to the true value of H after observing the true value of D is greater than or equal to the expected probability we assign to the true value of H before observing the value of D,

I have a very basic question about notation -- what tells me that H in the equation refers to the true hypothesis?

H stands for hypothesis. We're taking expectations over our distribution over hypotheses: that is, expectations over which hypothesis is true.

Put another way, I don't really understand why that equation has a different interpretation than the conservation-of-expected-evidence equation: E[P(H=hi|D)]=P(H=hi).

In the PPI inequality, the expectations are being taken over H and D jointly, in the CEE equation, the expectation is just being taken over D.

Yeah, I have intuitively the same interpretation.

Yeah, this is the one that I would have used.

This was fun!

I didn't take the time to check whether it did or didn't. If you would walk me through how it does, I would appreciate it.

Good shows that for every utility function for every situation, the EV of utility increases or stays the same when you gain information.

2 r (H) - \sum j r (h_{j})^{2}

This is the quadratic scoring rule, so $r (h_{i}) = P (h_{i})$ . Then its expected utility is :

\sum i P (h_{i}) [2 P (h_{i}) - \sum j P (h_{j})^{2}]

Simplifying:

2 \sum i P (h_{i})^{2} - \sum j P (h_{j})^{2} \sum i P (h_{i})

And since

\sum_{i} P (h_{i}) = 1

, this is:

\sum i P (h_{i})^{2}

Which is just

E [P (H)]

I think this is very well done. The explanation is sufficiently clear that even I, the non-formal-math person, can follow the logic.

There is actually much easier and intuitive proof.

For simplicity, let's assume H takes only two values T(true) and F(false).

Now, let's assume that God know that H = T, but observer (me) doesn't know it. If I now make measurement of some dependent variable D with value d_i, I'all either:

1. Update my probability of T upwards if d_i is more probable under T than in general.

2. Update my probability of T downwards if d_i is less probable under T than in general.

3. Don't change my probability of T at all if d_is is same as in general.

(In general here means without the knowledge whether T or F happened, i.e. assuming prior probabilities of observer)

QED

I had already proved it for two values of H before I contracted Sellke. How easily does this proof generalize to multiple values of H?

Very simple. To prove it for arbitrary number of values, you just need to prove that h_i being true increases its expected “probability to be assigned” after measurement for each i.

If you define T as h_i and F as NOT h_i, you just reduced the problem to two values version.

80

The Principle of Predicted Improvement

80

Proof:

Equality:

80

80