Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Attention conservation notice: Most of this post is a recap of the standard arguments for UDT, but the way in which the standard proof for value of information being nonnegative breaks down in some decision theory scenarios probably isn't common knowledge yet.

The standard proof that the value of information (VOI) is always positive is very simple and goes something like this. Let be a true underlying state from the set of all underlying states, let be an action selected from a space of possible actions, let be an observation from the set of all observations, and let be a utility function that maps an action and an underlying state to . is an abbreviation for sample information.

Let be the action selected by maximizing utility without looking at the sample information. Comparing the right-hand side of both lines, we can see that for any given piece of information , you can either copy (in which case the utility acquired will be the same), or select some other action (in which case you'll get more utility).

This is just the simple argument that, no matter what information you see, you can always just act as if you hadn't seen it to do as well as the ignorant version of you, and maybe you can do better.

Swapping out and with and respectively, generalize this proof to cover situations where the underlying state is correlated with your choice of action. This alteration gets Newcomb's problem right.

However, the proof breaks down when the probability distribution over which information you see, is correlated your choice of future action. Example problems with this property are XOR Blackmail, Parfait's Hitchiker, Transparent Newcomb, and any fictional setting with stable time loops. All of these problems have the property that selecting the best action conditional on the information you see does make you better off, but this decreases the probability of you getting into a favorable situation in the first place (according to past-you)

To formalize this, we need a way to let the probability distribution over the information vary depending on the probability distribution over future actions. Specifically, we will assume a continuous function which maps a probability distribution over future actions to a probability distribution over seeing the information in the first place. The policy of the agent (how it reacts to observations) is a Markov kernel of type , so all policies define a continuous function . Composing these two continuous functions gives a continuous function from to itself, so by the Tychonoff fixed-point theorem (the infinite-dimensional version of Brouwer's fixed-point theorem), all policies induce at least one probability distribution on that is a fixed point.

Let be the probability distribution induced by the policy that just does argmax after seeing the observation, while is the probability distribution induced by the policy that just takes some fixed action . Then the final lines from the proof turn into:

and suddenly we have that increased or stayed the same for all (by the same argument as before), while at the same time may be greater than because the probability distribution over observations is different. Parfit's Hitchiker is a good example, where the corresponds to your observation of whether you were taken into town. Selecting the best action conditional on your observations makes you better off in all situations, but because it affects the probability of the observations in the first place, it actually lowers expected utility.

22

Ω 8

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 10:12 PM

This seems very similar to the issues that I've explored in my recent posts (Evil Genie Puzzle, Decision Theory with F@#!ed-Up Reference Classes).

"Selecting the best action conditional on your observations makes you better off in all situations, but because it affects the probability of the observations in the first place, it actually lowers expected utility" - I think there's truth in this, but I don't think that the class of people who should be averaged over is at all clear from the definition of UDT.

I think VOI in UDT is still nonnegative, meaning that if you take a UDT scenario and obscure some of the observations made by the agents, they can't achieve higher utility because of that. Though of course you're right that they should cooperate, instead of having each agent select the action that's best according to that agent's observations.

New to LessWrong?