Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(I've discussed these ideas in the past, but this idea was lacking a canonical reference post)

Suppose that you are an expected-utility maximising agent. This means that when you make a decision you should calculate the expected-utility for each counterfactual and choose the counterfactual with the highest expected utility. However, if you know yourself well enough to know that you are an expected utility maximiser, this knowledge can change your counterfactuals. For example, this can provide information about your likely future actions, the expected number of expected utility maximisers in the world or how clones of you might behave. In general, if you are in an embedded problem and you have reflective knowledge of your own algorithm, then your decision theory provides both a system of evaluation and a data-point about the world. We can call this the embedded view.

In contrast, we could also imagine a utility maximiser outside the problem description, analysing the decisions of various agents within the problem according to how well they maximise utility, none of which need to be utility maximisers themselves. In general, when we use a decision theory in this way, we will call this the external view of the problem. Normally, we'll be talking about applying one or both of these two things:

We can use these as tools for evaluating agents that aren't necessarily using these tools. One case where this would be useful is when knowing how the agent makes a decision would mean that only a single decision would be consistent with the problem description and hence that the decision theory problem would be trivial in a particular sense. For example, an updateless agent will always one-box in Newcomb's, so it kind of doesn't make sense to ask what it would have scored if it had two-boxed. On the other hand, if we just use updateless decision theory to help evaluate the situation externally by constructing our counterfactuals, then we don't run into this issue and we end up with multiple consistent counterfactuals.

Another situation where this is useful is the Smoking Lesion Problem. The Smoking Lesion Problem has been criticised in a few places for being inconsistent. As Abram Demski says, "It is assumed that those with a smoking lesion are more likely to smoke, but this is inconsistent with their being EDT agents" (these agents choose the action with maximise utility according to EDT counterfactuals, so there isn't any choice unless the utility is equal in both actions). However, while it is inconsistent to assume that the agents in the Smoking Lesion Problem are all EDT agents, it is natural to imagine an EDT agent using EDT counterfactuals to evaluate agents which don't necessarily follow EDT.

This post was written with the input of Davide Zagami and Palbo Moreno and supported by the EA Hotel and AI Safety Research Program


Ω 3

4 comments, sorted by Click to highlight new comments since: Today at 10:00 AM
New Comment
For example, [for] an updateless agent that always ["]one-boxes["] in Newcomb's, ... it kind of doesn't make sense to ask what that agent would have scored if that agent had two-boxed.

An updateless agent could (always) two-box.

The agent has code. It can only do what the code says. If the code will make it one box, there was a sense in which it never could have two-boxed.

But if the code will make it two box, there was a sense in which it never could have one-boxed.

That's true also