Ω 6

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Probability as Minimal Map argued that the probability is a minimal representation of the information in data which is relevant to the query . In other words, is a perfectly efficient map of some territory based on data , suitable for the query . More generally, a full probability distribution is a perfectly efficient map suitable for any queries about the random variable . Bayesian probability tells how to update our map as we gain new information.

What if, for some reason, we wanted to instead throw away some particular information? We still want to represent our map using a probability distribution, but rather than adding information to it (via Bayes’ Rule), we want to remove some information.

Coin-Sum Problem

I flip two coins, and observe both outcomes - either 0 or 1 for each. I want to keep as much information as possible, while throwing away all information from the observation relevant to their sum. How should I “update” my distribution over the outcomes?

We’ll write the outcome of our coinflip as (“B” for “bit” or “binary” or “Bernoulli”), and our final information-removed distribution as - so the notation indicates that we should throw out all info about the sum (the “/” is meant to evoke e.g. a group quotient operation). Note that this kind of “remove information” operation does not follow the usual rules of Bayesian probability.

At first, I reason as follows:

• My final distribution is a function of the outcomes I saw, so my overall strategy should specify four different distributions - one each for (0, 0), (1, 0), (0, 1), and (1, 1). E.g. if I see (0, 0), then I will update to the (0, 0) distribution
• The final distribution for (0, 0) and (1, 1) must be the same, otherwise we could distinguish a sum of 0 from a sum of 2 by looking at which final distribution was chosen.
• … but then the final distributions for (0, 1) and (1, 0) must also be the same as (0, 0), else we could tell that the sum is 1 whenever we see some other distribution.

… so in order to throw out all information about the sum, we have to throw out all the information from the observations. We effectively don’t update at all, and just keep our prior.

This seems kind of weird from an information-theoretic standpoint. We have 2 bits of information from the observation, and the sum only contains bits. It seems like we ought to be able to keep around bit of information, somehow.

The trick turns out to be right at the start of the above reasoning: “my final distribution is a function of the outcomes I saw”. This quietly assumed that our “update” had to be deterministic. We can do better if we allow randomized updates - e.g. if I see (0, 0), then I randomly choose one of several distributions to update to.

Here’s an example of a randomized strategy:

• (0, 1) and (1, 0) have the same sum, so I’d like to be able to distinguish them with certainty. I’ll make and each deterministic, and assume they’re not equal.
• In order to be unable to distinguish between sum 0 and sum 1, when the outcome is (0, 0) I’m forced to randomly choose between and , 50% chance of each. So, = 50% chance , 50% chance .
• Same reasoning applies to (1, 1) as (0, 0).
• In order for the final distributions to accurately represent our final information, they must be and .

Half the time (when the sum is 1) this strategy conveys 1 bit of information (whether the first or second coin is the 1), so overall we keep bit - exactly the information-theoretic limit.

Generalizing

The general form of the problem:

• We have some random variables and . We observe , and want to throw out all information about .
• To do that, we construct some new variable (which may depend on , , and some randomness), which contains as much information as possible about but no information about .
• We then throw away and keep , effectively “updating” our probabilities from to .

In the coin example above, we can encode as: 0 when is (0,1), 1 when is (1,0), otherwise a random 50/50 choice between 0 and 1. Calculating then yields the distributions from the previous section. In this case, we can interpret S as the value of coin 1 assuming the sum was 1 - otherwise it's random.

Two obvious questions are existence and uniqueness:

• Is it always possible to find some which achieves the information-theoretic bound, i.e. (where is mutual information and is entropy)?
• Is unique?

Uniqueness we can answer easily: no, the “optimal” reduced information is not unique. Counterexample: flip two coins, and throw out all info about whether the sum is odd or even. We have 2 bits of info total, we have to throw out 1 bit of info, and we can keep around 1 bit in (at least) two different ways: just keep the outcome of the first coin, or just keep the outcome of the second coin. Either of these, by itself, contains 1 bit of info and tells us nothing about whether the sum is odd or even.

Existence is trickier.

If we observe both and , then we can definitely construct , using cousin_it’s method from this comment: for each possible value of , contains an -value randomly sampled from - except for the observed -value, which maps to the observed -value. For instance, if we observed (1, 0) in our coin-sum problem, then one possible value of would be - so S tells us that if the sum Y=1, then X = (1, 0). tells us nothing at all about , but if we later re-learn the value of , then we can just look up the value of in . This implies that achieves the information-theoretic bound: and together are sufficient to reconstruct .

However, we may want to throw out whatever information we have about some , even when we don’t have complete information about . In other words, what we really want is to construct without knowing - i.e. construct from alone. I still don’t know whether the information-theoretic bound is always achievable in this case, or what the optimal information content is if the bound isn’t achievable.

Why Is This Interesting?

A lot of tricky problems in decision theory feel like the agent should choose to throw away information. Chris Leong argues that this is a useful way to think about logical counterfactuals in general, and I’ve heard other people express similar intuitions. The approach in this post offers a way to formalize and quantify “forgetting”, in a form potentially useful for these kinds of applications.

Along similar lines: the sort of randomization used in game theory feels different from the sort of “randomness” involved in uncertainty. The former is utilitarian, and used to confuse external opponents. The latter is epistemic, and only relevant to internal beliefs. The randomization involved in throwing away information feels similar to game-theoretic randomness, yet it’s in the sort of informational framework usually used for reasoning about uncertainty - suggesting that these two types of randomness could be handled in a more unified conceptual framework.

For example: suppose agent 1 and agent 2 play a zero-sum game. Agent 2 chooses its action by running a copy of agent 1’s source code, then outputting whatever action beats the copy’s action. What should agent 1 do? One possible approach is for agent 1 to throw away any information about its copy’s action which is contained in its own action - so that the copy’s behavior yields no information about agent 1’s behavior. Randomization then naturally enters picture, due to throwing away information. (This doesn’t seem like quite the right decision-making rule in general, but it feels like something similar could work - the main idea is to make “throw away information” an action in the game itself, and push game-theoretic randomization of choices into the info-throw-away action.) One could imagine re-deriving Nash equilibria via agents throwing away information as an action, without any other source of randomness. Whether that can actually work is another open problem.

Ω 6

Pingbacks
New Comment

The bound isn't always achievable if you don't know Y. Suppose with uniform over the 6 possible outcomes. You find out that (without loss of generality). You must construct a function st . Because as far as you know, could be 2 or 3, and you have to be able to construct from and . But then we just take the most common output of , and that tells us so .

(If you had some other procedure for calculating X from S and Y, then you can make a new function that does that, and call it S, so an arbitrary function from X to Y+{err} is the most general form of S.

Nice argument - I'm convinced that the bound isn't always achievable. See my response to cousin_it's comment for some related questions.

Yeah, it seems that if Y isn't uniquely determined by X, we can't reversibly erase Y from X. Let's say we flip two coins, X = the first, Y = the sum. Since S depends only on X and some randomness, knowing S is equivalent to knowing some distribution over X. If that distribution is non-informative about Y, it must be (1/2,1/2). But then we can't reconstruct X from S and Y.

Nice example - I'm convinced that the bound isn't always achievable. So the next questions are:

• Is there a nice way to figure out how much information we can keep, in any particular problem, when Y is not observed?
• Is there some standard way of constructing S which achieves optimal performance in general when Y is not observed?

My guess is that optimal performance would be achieved by throwing away all information contained in X about the distribution P[Y|X]. We always know that distribution just from observing X, and throwing away all info about that distribution should throw out all info about Y, and the minimal map interpretation suggests that that's the least information we can throw out.

I might misunderstand something or made a mistake and I'm not gonna try to figure it out since the post is old and maybe not alive anymore, but isn't the following a counter-example to the claim that the method of constructing S described above does what it's supposed to do?

Let X and Y be independent coin flips. Then S will be computed as follows:

X=0, Y=0 maps to uniform distribution on {{0:0, 1:0}, {0:0, 1:1}}
X=0, Y=1 maps to uniform distribution on {{0:0, 1:0}, {0:1, 1:0}}
X=1, Y=0 maps to uniform distribution on {{0:1, 1:0}, {0:1, 1:1}}
X=1, Y=1 maps to uniform distribution on {{0:0, 1:1}, {0:1, 1:1}}

But since X is independent from Y, we want S to contain full information about X (plus some noise perhaps), so the support of S given X=1 must not overlap with the support of S given X=0. But it does. For example, {0:0, 1:1} has positive probability of occurring both for X=1, Y=1 and for X=0, Y=0. i.e. conditional on S={0:0, 1:1}, X is still bernoulli.