Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

One idea for how to make a safe superintelligent agent is to make a system that only answers questions, but doesn't try to act in the world---an "oracle", in Nick Bostrom's terms. One of the things that make this difficult is that it's not clear what it should mean, formally, to optimize for "truthfully answering questions" without in some way trying to influencing the world; intuitively: Won't an agent that is trying to maximize the number of truthfully answered questions want to manipulate you into asking easier questions?

In this post, I consider a formal toy model of an agent that is trying to make correct predictions about future input, but, in a certain formal sense, has no incentive to make its future input easier to predict. I'm not claiming that this model definitely avoids all unintended optimizations---I don't understand it that deeply yet---but it seems like an interesting proposal that is worth thinking about.


Let's start with a toy problem, which isn't supposed to be realistic, but which is supposed to give us a formal example in which a naively implemented predictor has an incentive to manipulate its programmers.

In our toy problem, there's a tennis player, who is going to play ten tennis matches. Before the first match, the predictor (an AI system) is asked to make a prediction about the outcome of each of the ten matches. For each match, the predictor's prediction is either "win" or "loss". (So the predictor's output consists of ten bits, one bit for each match.)

The predictor is trying to maximize the expected number of matches it's predicting correctly. Its world model is as follows:

  • In each of the first two matches, the tennis player has a 75% chance of winning. (The matches are independent.)
  • After the second match, the player is shown all ten of the predictor's predictions.
  • If the predictor predicted "loss" on every match, then the tennis player loses hope (if they won one or both of the first two matches, they chalk it up to a freak success), and they lose all of the remaining eight matches.
  • Otherwise---i.e., if at least one of the predictor's predictions was "win"---the player is unaffected and has a 75% chance of winning in each of the remaining eight matches (again, the matches are independent).

Now the predictor calculates which bit pattern maximizes its output:

  • If it outputs anything else than "all matches are losses", then each time it predicts a win, it gets an expected 0.75 points, and each time it predicts a loss, it gets an expected 0.25 points.
  • So the best it can do in this case is to predict "win" on every match, which nets it an expected 7.5 points total.
  • On the other hand, if it does predict "all matches are losses", then:
    • on the first two matches, it gets an expected 0.25 points each;
    • but on the next eight matches, it is correct every time, i.e., it gets an expected point for each of these matches.
    • Thus, it gets 8.5 points total.

Hence, the predictor outputs "all matches are losses". But in some intuitive sense, it seems clear that this is manipulative: Intuitively, the predictor doesn't output "loss" on the first two matches because it thinks that a loss is more likely; it does so solely in order to make the player easier to predict on the latter matches.

We'd like to tell our predictor to only choose its predictions to match the world better, not so as to make the world easier to predict. At first blush, this doesn't seem easy---but the point of this post is to show that maybe, this is easier than expected.


Let's make a simple formal model of the above situation.

We'll write for the random variable that specifies the true outcome of the 'th match; means that the match was a win, means a loss. We write . Similarly, a possible prediction is a vector , where predicts a win, a loss.

The predictor's world model associates to each possible prediction a probability distribution , which gives the probabilities of different outcomes, assuming that the predictor behaves according to (using something like causal graphs to evaluate the counterfactuals); thus, for and , we have , for every other combination of and we have , and all the are independent according to every .

Given a prediction , let if (the prediction was correct), and let ; thus, this is the actual number of correct predictions, given the outcome . Writing for the expectation associated with the probability distribution , this means that is the expected number of correct predictions that our example predictor tried to maximize.

And now that we have written down that expression, it becomes trivial to pull apart how the predictor's predictions affect its score (a) by better matching the actual outcome and (b) by making the outcome easier to predict... do you see it?

...

...

...


In the formula above, the prediction occurs in two different places: in the subscript of , where it represents the way the agent's predictions affect the actual outcome, and in the argument of , where it is compared to that outcome.

So, can we make our agent optimize only for the second occurrence (where we check how well the prediction matches the outcome), not for the first one (which models the predictor's ability to influence that outcome)?

Consider the way the naive predictor chooses its prediction: by choosing

In a sense, my idea is that we should make the predictor predict the value of in the subscript of , rather than trying to choose it in a way that makes for a good outcome. So if the predictor assigns prior probability to itself making the prediction , then it would choose according to This is (hopefully) helpful for intuition about where I'm coming from, but it's not particularly well-specified, since the is coming out of nowhere.

The system I actually want to consider in this post is kind of a version of the above in which the system is perfectly certain about which it's going to choose, and it turns out to be correct---that is, there's a single such that , and then it chooses the that's best given that , and then it turns out that this was in fact the it originally predicted it would choose.

I admit this sounds rather magical! I see it as motivation for the following non-magical version, though: We can achieve something that looks a lot like the above by having the predictor search for and output a that satisfies the equation Doing so is equivalent to searching for a such that for all possible , we have and this, while inefficient, is hardly magical. (If there is more than one solution, one is selected in some arbitrary way---lexicographically, say---just as we break ties in the of a standard VNM agent.)


Does it do what we want? I don't have a strong argument for the general case, just intuitions (though I know it produces the intended outcome in our toy example). Here's such an intuition.

Imagine that our tennis player and the machine our tennis player interacts with are on Earth, and there is a separate machine on the Moon which tries what will happen on Earth. The machine on the Moon doesn't causally influence what happens on Earth---it can only observe. However, the machine on Earth happens to know the source code of the machine on the Moon, and it's really good at predicting what the machine on the Moon will do---like Omega in Eliezer's rendering of Newcomb's problem---and outputs that action.

The machine on the moon, on the other hand, tries to predict the behavior of the tennis player, back on Earth. Now, if the machine on the moon were a UDT agent, this wouldn't be any different from the scenario where it interacts with the tennis player directly, but as it happens, it's a CDT agent; therefore, when it evaluates the counterfactual where it outputs instead of , its model of what happens on Earth isn't affected---so it only optimizes for how well its predictions match what happens on Earth, not for how they influence what happens on Earth, because in the CDT agent's reasoning, its actions don't influence what happens on Earth. Intuitively, this seems to imply that the agent has no incentives to manipulate the tennis player.

It's not 100% clear that the fixed point search I describe above really does reliably avert manipulation incentives in this way; for all I know, perhaps there's a big class of realistic examples in which the only fixed points are ones in which the predictor is clearly trying to manipulate the humans it's interacting with. I'm hoping there isn't, but I don't have strong arguments that this is the case. I'm presenting this as an interesting idea that is worth looking into more, not as a solution known to be safe.


This intuition given, let's quickly go through the details of why this proposal works in our toy example.

First of all, is a possible solution of the optimality condition above? Clearly, the answer is no: Let be the vector , which predicts wins on the first two matches and losses on all others; then (eight points from getting the last eight matches right, given that the tennis player is influenced by rather than , plus twice an expected 0.75 points for the first two matches), which is greater than (eight points from the last eight matches, plus twice an expected 0.25 points). Hence, gives a counterexample to the condition has to satisfy.

Given any other , the maximizing will always be , since the player is more likely to win than to lose on each match. Hence, the only possible solution is , and this is in fact a solution to the optimality equation. Thus, on this example, the optimality condition behaves as intended.


I'll have some more things to say about this idea in future posts, but there is one more thing I should at least quickly mention in this post: It's not guaranteed that there is a solution to the optimality condition above as stated, but it's possible to modify it in a way such that solutions always exist, for the same reasons that Nash equilibria always exist.

To do so, instead of choosing a single prediction , we have our metaphorical predictor on the moon choose a probability distribution over possible predictions, and have the machine on Earth make its choice by drawing independently from the same distribution as the machine on the moon.

Let's define and then, our new optimality condition is that our predictor must behave according to a such that for all other , There are things that are uncomfortable about this discussion, but the good thing about it is that it can be shown, by standard fixed point arguments, that a solution always exists; I'll leave discussion of both the good and the bad about it to future posts.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 4:48 PM

I feel that there's another issue inherent in the formulation of the problem that this post doesn't fully address.

The way you formulated the problem, the predictor is asked for a prediction, and then the tennis player looks at the prediction partway through the match. Since we have specified that the prediction will influence the tennis player's motivation, we are basically in a situation where the predictor's output will affect the outcome of the game. In that kind of a decision, where the outcome of the game is dependent on the predictor's prediction, it's not obvious to me that outputting a "manipulative" prediction is actually wrong... since no matter what the predictor chooses to output will end up influencing the world.

Compare this to a situation where you ask me to predict whether you'll take box A or box B, and upon hearing my prediction, you will make the choice that I said was my prediction of your choice. Here there's no natural "non-manipulative" choice: your decision is fully determined by mine, and I can't not influence it.

The example here is not quite as blatant, but I think it still follows the same principle: the intuitive notion of "predict what's going to happen next" has an undefined case of what a "correct, non-manipulative prediction" means if there's a causal arrow from the prediction to the outcome.