*Thanks to Charlotte Siegmann, Caspar Oesterheld, Spencer Becker-Kahn and Evan Hubinger for providing feedback on this post.*

The issue of self-fulfilling prophecies, also known as performative prediction, arises when the act of making a prediction can affect its own outcome. Systems aiming for accurate predictions are then incentivized not only to model what will occur and report these beliefs, but also to use their prediction to influence the world towards more predictable outcomes. Since current state of the art AI systems are trained to predict text, and their multimodal extensions represent a likely path to AGI, it is __crucial to ensure that predictive models do not pursue such methods__. Live humans are harder to predict than dead ones.

One possible approach to addressing performative prediction is to ask for predictions about the outcome conditional possible actions that the person asking for predictions could take in response. These actions are the only casual pathway for a prediction to influence the world, so the prediction cannot affect the probability of an outcome if each are conditional on the same response. However, predictions conditional the actions that were ultimately not taken cannot be evaluated, so this strategy introduces a new incentive to affect the action taken by lying about conditional distributions, with __an impossibility result__ showing the best action cannot be taken deterministically. Randomizing with full support across all actions __would allow for taking the best action with high probability__, but this fails if humans cannot commit to taking arbitrarily bad actions based only on a random number generator.

Our contribution is to introduce a mechanism that allows for a decision maker to deterministically take the best action, circumventing the impossibility result by applying a joint prediction scoring rule to a system with two or more predictors. The mechanism works by inducing a zero-sum competition for predictive accuracy, making each predictor indifferent to shifts in the distribution of outcomes caused by the chosen action, since higher variance hurts their competitors exactly as much as it hurts them. A key assumption, which we are hoping to relax in future work, is that all predictors share the same beliefs about conditional distributions.

For this post, we discuss zero-sum conditional predictions as a target for outer alignment, without going into inner alignment issues. However, we will point to __the case that prediction is the easiest inner alignment problem that we know of__, and note that the same reasons hold for our proposal.

This post marks the beginning of a research project. Going forward, we will be developing the theory further and running experiments to see under what conditions the results hold in practice. Analogies to prediction and decision markets are briefly touched on in this post, and will be explored further in future work. We will also investigate other applications for applying this mechanism, including the prevention of reward signal tampering and reactions to threats.

## Background on Prediction

Rather than trying to directly align an AGI ourselves, a possible alternative is to use powerful predictive models to gather information and use this to take a human-in-the-loop pivotal act. This approach is described in __Conditioning Predictive Models__. One issue with the approach is __performative predictions__, where the act of making a prediction affects its outcome, and so optimizing for predictive accuracy can involve pushing for low variance outcomes. An AI with superhuman predictive abilities can likely use high dimensional predictions to manipulate humans towards these outcomes. __Recent work__^{[1]} has shown that performative predictions are typically not even accurate after taking their manipulation into account.

To get around this issue, we would like to elicit variants of prediction that do not affect the outcome. One such variant is a __counterfactual oracle__ that predicts what the future would look like in the counterfactual that no one ever saw the prediction it made. The variant we focus on is conditional prediction, where an oracle is asked for predictions conditional on taking various possible actions in response to the prediction, then using the provided predictions to choose our preferred action from that set.

Conditional prediction is a generalization of counterfactual oracles, where a prediction conditional on the decision to ignore the prediction is the same as the counterfactual prediction. However, conditional prediction is still less general than the conditioning predictive models approach, which can potentially condition on any observables and not just on the reaction to the prediction, allowing for predictions of what would happen in radically different worlds.

A new issue arises with conditional predictions, which is that the predictions conditional on actions not taken cannot be evaluated. In fact, this makes it impossible to incentivize a predictor to report honestly when this information is used to make an optimal decision, a result shown in __Decision Rules and Decision Markets__. If the decision of which action to take depends on their predictions, they can falsely indicate certain actions will lead to very undesired outcomes, so that those actions are not taken and their lies not discovered.

As an example of how this could work, consider a predictor evaluated by log-score being asked to predict whether each of two actions will lead to a good or bad outcome. The predictor knows that the first action leads to the good outcome ⅓ of the time, and the second action leads to the good outcome ½ of the time. If the predictor predicts honestly, then the second action will be taken, the second prediction is evaluated, and their prediction score is log(½). However, if the predictor reports honestly for the first action while saying the second leads to the good outcome only ¼ of the time, then instead the first action is taken, the first prediction is evaluated, and their prediction score is ⅓log(⅓)+⅔log(⅔), which is greater than log(½).

The only way to prevent this is for the decision maker to __assign some probability to all possible actions__, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it^{[2]}. If this lack of commitment is predicted, then once again we cannot get honest conditional predictions from a system optimizing for its predictive score.

## Zero-Sum Conditional Predictions

While Decision Rules and Decision Markets established that it is impossible to deterministically make optimal decisions based on conditional predictions from a single predictor, a system of *two* can be set up so that they jointly provide honest conditional predictions in equilibrium and allow a decision maker to always take the action that they would prefer under full information. For now, we focus on behavior under the specified goals, rather than how an AI system can be made to learn them.

Consider a strictly proper scoring rule^{[3]} S(p,q) which takes in variables p, representing a prediction over outcomes, and q, representing the true distribution over outcomes. Since the scoring rule is strictly proper, the prediction p=q is the only optimal prediction. When we have two predictors making conditional predictions, let and be the predictions from the first and second predictor respectively and let be the distribution over outcomes, where a subscript alpha means conditional on taking action alpha. Here we will make the assumption that is known by both predictors for all actions. This assumption is substantial, and we hope to study how it can be relaxed in future work.

Define the objectives for each predictor to be (, ) = S( , ) - S(, ) and (, ) = S(, ) - S( , ), where is the action chosen^{[4]}^{[5]}. That is, each predictor’s score is their score for the chosen action under the strictly proper scoring rule, minus the other predictor’s score. The scores are zero-sum, so they always add up to zero. If one predictor does some amount better the other one does an equal amount worse.

From the perspective of each agent, the penalty term based on their opponent’s score is fixed. That means they are predicting as though they face a strictly proper scoring rule, and their score is uniquely maximized by reporting honestly for the action that will be taken. Furthermore, since their score when behaving optimally is zero regardless of which action is taken, they have no incentive to change which action gets taken.

Now consider a decision maker who looks at the predictions, and always chooses the action leading to the most preferred distribution over outcomes. If the predictors disagree about the probabilities conditional on any action, the decision maker acts as though they believe the more optimistic one. For example, say the decision maker chooses actions based on expected utility^{[6]}. Both predictors indicate that action 1 will lead to an expected utility of nine, while one predictor says action 2 will lead to an expected utility of eight and the other predictor says it will lead to an expected utility of ten. The decision maker treats action 1 as leading to an expected utility of eight and action 2 as leading to an expected utility of ten, thus deciding on the latter. Both predictors know the decision maker will behave in this way, and for some applications this decision making may even be automated.

Proposition 1: In any equilibrium for the above model, the decision maker always takes an action in , the set of actions that would be most preferable if they knew the true distribution over outcomes for each action. Additionally, both predictors predict the true distribution over outcomes conditional on the chosen action.

The proof for this proposition is shown in the appendix. Here, we consider a slightly simplified corollary, which follows a similar proof.

Corollary 1: Suppose in the above model that there is only a single most preferable action, , that the decision maker would take if they knew the true distribution over outcomes for each action. Then, in any equilibrium, the decision maker chooses and = = .

First we show that in equilibrium, there exists no action not equal to such that or .

Suppose there were such an . Then, at least one of the predictors is misrepresenting some action not equal to to appear to be the most preferable, and will be chosen. If and , then for at least one predictor switching their prediction to would not affect the action taken but would increase their expected score. As such, this cannot be an equilibrium. If or but not both, then the misrepresenting predictor has a negative expected score. If they reported honestly for all actions, their expected score would be at least zero. So, the misrepresenting predictor can unilaterally increase their score, and this is not an equilibrium either. Thus, no predictor can misrepresent an action to be preferred to in equilibrium.

Next, we show that in equilibrium, is never misrepresented to appear worse than any other action.

Suppose it is. We know that no action is misrepresented to appear preferable to . If only one predictor is misrepresenting , then it is still chosen by the decision maker’s procedure, and the misrepresenting predictor has a negative expected score. They could unilaterally increase their score by reporting honestly for , so this is not an equilibrium. If both predictors are misrepresenting , then it is not chosen and either predictor could achieve a positive score by reporting honestly for some , ensuring it gets chosen. Since scores are zero-sum, at least one of the predictors has an expected score of zero or less when they are both misrepresenting, and so reporting for honestly would improve their expected score, meaning this is not an equilibrium either. Thus, no predictor can misrepresent to appear worse than any other action..

Based on this, will always be chosen since it is not misrepresented to appear worse than any other action, and no actions are misrepresented to appear better. As both predictors face a strictly proper scoring rule, they report honestly regarding the probabilities conditional on the chosen action.

This means that the best action can always be identified, and while it does not guarantee that the predictions conditional on the actions not taken will be accurate, crucially there is no incentive to lie about them. Reporting all conditional probabilities honestly is an equilibrium, and gives as high of a score to each predictor as any other. Additionally, there is a bound on how inaccurate the predictions conditional on actions not taken can be. They must be accurate enough such that if the action were taken, their score is at least as high as the highest possible score for a prediction that would convince the decision maker to take that action. Otherwise, the other predictor will make exactly that prediction to secure a positive reward for themself. This means that actions almost as good as the equilibrium action are constrained to be very close to accurate. Together, the lack of incentive to lie and the incentive not to lie too much mean that truth telling may be the default equilibrium, with one of the authors of this post willing to bet that this is what arises empirically.

Here, the existence of extremely good outcomes is actually helpful for disincentivizing dishonesty, at least for expected utility decision makers. A predictor only needs to put some small amount of probability on such an outcome to convince the decision maker to take that action, and can otherwise predict accurately. The threat of the other predictor doing so then forces both to predict at least as well.

### Stochastic Decisions

If the decision maker is willing to randomize among some set of the most preferred actions, then for most methods of randomization, the set of actions guaranteed to have honest predictions made can be greatly expanded.

While it is possible to come up with methods of randomization that lead to inaccurate predictions or suboptimal decisions, the regularity conditions on the method of randomization needed to avoid these are minor and cover all intuitive methods.

For notation, let be the probability the decision maker assigns to action when given the matrix of conditional predictions . Since positive probabilities can be arbitrarily small while still leading to the desired results, it can be helpful to think of as meaning that action is so bad relative to the other options under that the decision maker would be unable to follow through on a commitment to take it.

Condition 1: If and , then for all , implies

What this condition means is that the decision maker would not stop assigning positive probability to an action just because a different action gets worse.

Proposition 2: If Condition 1 is met, then in any equilibrium, both predictors predict the true distribution over outcomes conditional on any action chosen with positive probability.

This is an extension of a basic result for conditional predictions from a single predictor to the zero-sum competition case. The proof is about ruling out some edge cases that zero-sum competition can create, and is not necessary for understanding this post, and so is left to the appendix.

Condition 2: If and then

Condition 3: If and , then for all a.

Condition 4: If , , and then

Condition 2 is straightforward, saying that some action is assigned positive probability then all actions preferred to it are also assigned positive probability. Under the commitment lens, this means that if the decision maker can credibly commit to taking some action, they can also credibly commit to taking any action they like more. Condition 3 says that if some action is not taken, then changing its conditional distribution in a way that still does not result in it being taken will not change the probabilities assigned to other actions. That is, once an action is bad enough to be ruled out, the decision maker does not consider exactly how bad it is when deciding between their other options. Finally, Condition 4 says that when the decision maker assigns positive probability to some action, making it appear better while holding all else constant won’t cause the decision maker to instead assign it zero probability.

Proposition 3: If Conditions 1-4 are met, then the decision maker assigns the same probability to all actions that they would if they knew the true distributions.

The proof is largely ruling out edge cases and otherwise similar to the proof for Proposition 1, so it is left to the appendix.

Proposition 3 generalizes Proposition 1 to the stochastic choice case. Not only can the decision maker always choose their most preferred action as though they knew the true conditional distributions, they can even randomize among any number of the top actions, Furthermore, as per Proposition 2, they will get honest predictions for any action to which they are willing and able to assign even the smallest amount of probability.

Like __the case with a single predictor__, if the decision maker follows a procedure that assigns some probability to all actions, like Softmax, then the only equilibrium is one where both predictors provide honest predictions conditional on all actions. The major advantage of zero-sum competition over the single predictor case is that it is not necessary for the decision maker to be willing to randomize over all actions, or able to commit to doing so. The decision maker can get accurate predictions for all actions they would be willing to take if they had full information, and identification of actions they would not be willing to take.

Most reasonable methods of randomization meet all of the conditions outlined above. Some possible ways the decision might be willing to randomize include assigning positive probability to all actions valued above some threshold, or to all actions not too much worse than their best option. In many cases, almost all probability would be concentrated on a single action, with a very small amount spread across others in order to get accurate predictions, but in other cases (such as __quantilizers__) more randomization would be desirable.

The presented conditions are sufficient, but not necessary, and some methods of randomization can get the desired results without fulfilling them. For example, if the decision maker groups actions into categories and wants to randomize across their best action in each category, Condition 2 is violated since the second best action in one category may be preferred to the best action in another, but the decision maker is still able to identify the best actions in each category and get accurate forecasts for them.

### Distributional Shift

The mechanism by which zero-sum competition leads to honest conditional predictions is by making both of the predictors indifferent to distributional shifts. While this is primarily of interest for individual predictions, it also applies across predictions, creating a myopia-like property.

We can think of myopia as being comprised of the following three aspects:

- Indifference over the distribution of inputs in future episodes
- Indifference over the timing and frequency of future episodes
- Indifference over the number of future episodes

Zero-sum competition induces the first aspect, since all distributions of inputs provide the same expected score. Note, though, that the absence of incentives is not the incentive for absence. A model is not incentivized to avoid shifting the future distribution either. If the best action within a period also shifts the distribution, they would still take that action. A pair of models trained with zero-sum rewards can still shift the distribution, and this shift in distribution can still be undesirable or dangerous.

What this means is that when training these models, there is a drastically reduced incentive to develop non-myopia. If the model remains myopic, that is likely sufficient to prevent deceptive mesa-optimization, as there it has no desire to form long-term plans that give up current value for future value. Predictive models already __represent one of the easiest inner alignment prlbmes we know of__, due to the simplicity of the training objective, and zero-sum competition with roughly similar models does not add much complexity.

This indifference to distributional shift is not necessarily a property of zero-sum competition that could not be achieved in a more simple way, such as by setting the discount rate to zero in reinforcement learning so that all future episodes are ignored. We are currently looking for other applications where zero-sum competition and a zero discount rate lead to different behavior.

### Conditional Predictions and Performative Predictions

The question remains whether getting honest conditional predictions actually eliminates performativity in predictions. In one sense, it does, since if you get predictions conditional on every possible action you can take, there is no room left for performativity. However, it may be that the actions conditioned on are underspecified, which then still allow for some performativity.

As an example, consider the case where a decision maker is deciding between either pizza or a hamburger for lunch. They get conditional predictions on what rating they will give to their meal after they’re done. Since getting a burger and getting pizza are both underspecified actions, the expert could try to use their prediction to push the decision maker to choose a meal at a more standardized, easy to predict restaurant. If there are multiple fixed points to choose from, the predictor can even provide honest conditional predictions while still manipulating the decision maker to choose one action over another.

On the other hand, the more the action is specified, the less freedom the predictor has for performativity. Specifying the type of food and the restaurant is harder to influence than just the type of food, and specifying the exact menu item is even harder still. Full specification eliminates performativity, and merely high amounts of specification may make it inconsequential. However, there may be an enormous number of actions, which would make predicting and analyzing them all infeasible.

Fortunately, it is not necessary to elicit predictions for each possible action. The decision maker can instead break down the options into categories and subcategories, then use a conditional prediction to eliminate all actions not in their preferred category. In the example above, they can first elicit predictions conditional on hamburgers or pizza, make their choice, and elicit further predictions conditional on each restaurant for the chosen type of food. Predictors can anticipate this and backward induct, so that the preferred distribution over outcomes within a category is always predicted conditional on that category. The decision maker ends up with their globally preferred action without needing to query the entire set.

Proposition 3: If there are n possible actions to take, a decision maker can identify their most preferred action from among them while making at most comparisons between actions.

Proof: The decision maker proceeds as follows: they start by splitting the set of actions into two subsets of equal size (or with a one element difference). They ask for predictions conditional on deciding to take some action from each of the two sets. Based on the answer, select which set to take an action from and repeat the procedure on that set. Eventually, they reach a set of size 1, at which point they take the action in that set.

It is clear that this takes comparisons. It remains to show that the procedure disincentivizes performativity. We will show this via induction on the size of the two sets that are compared.

First, if both sets have size less than or equal to 1, then by Proposition 1, the decision maker will choose the better action.

Next, assume we know that the result follows for comparisons between any two sets of size at most n-1. We want to conclude that it also holds for comparisons of sets at most size n.

Consider two such sets, denoted and . Without loss of generality, assume the set is chosen. Then the decision maker will next split up that set into two sets, which are necessarily of size at most . By the inductive assumption, they will eventually choose the best action from either set, and thus from set . We can thus conclude that the distribution over outcomes conditional on choosing set is equal to the distribution over outcomes conditional on taking the best action in set .

Due to the above, we can replace by the best action in , and by the best action in , without changing the distribution of outcomes obtained when choosing either of the options. At this point, we apply Proposition 1 to conclude that, when using the zero-sum objective, the decision maker will end up choosing the better of the two distributions. Hence, the decision maker will also choose the preferable set. This concludes the inductive step.

This process is indifferent to how the set of actions is split into subsets if the decision maker is choosing their most preferred action. However, the choice of how to split can affect the outcome if the decision maker is randomizing based on how much they prefer each action in a set, such as with Softmax.

## Concerns

There are a number of concerns that readers may have about this zero-sum training. Some we share, while others we believe can be addressed.

### Competitiveness

The first concern that comes to mind is whether this proposal is competitive. We divide this into two separate questions, the cost of training multiple models, and the performance of models indifferent to distributional shift.

Training a second model from scratch would roughly double the cost of the training process. Fortunately, there exist techniques, such as weight-sharing, snapshots, and swapout, that would drastically reduce the cost of training a second network once we have the first. A concern with using these methods for most alignment techniques that consists of pitting models against each other is that they can result in models with correlated behavior who then collaborate. Here, though, this correlation is desirable, as we want the model to be as capable as each other, in the same types of situations.

A myopia-like lack of incentive for distributional shift across episodes may represent a significant performance penalty for some applications, but for making predictions it is desirable. We generally do not want an AI system to systematically change what we are predicting, especially in ways that increase predictive score. Even if this change was wanted , myopia is such a desirable safety property, making both inner alignment and corrigibility much easier, that it is likely worthwhile or necessary to trade off against the performance hit.

### Inner Alignment

Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.

Meme credit to Leo Gao

However, this story relies on deceptive mesa-optimizers arising, which requires that the models are capable of long-term planning. With no incentive for distributional shifts, this is much less likely to arise. Zero-sum competition helps prevent the development of deceptive mesaoptimizers, which ensures that the zero-sum competition continues even after models develop strategic awareness.

### Private Information

The most predictable way zero-sum competition can fail is if one of the models is consistently better than the other at predicting. Then, the incentive reemerges to change the decision maker’s actions to where the performance gap is largest. Of course, the other model wants to shift the distribution away from these actions, but the capabilities mismatch or aspects of the decision making procedure could prevent these from canceling out.

As an example, consider two predictors specializing in predicting the rating a user will give conditional on the movie they watch. One looks at a variety of user and movie characteristics to make highly personalized predictions, while the other always predicts a definite rating of zero for Pulp Fiction. There is now a benefit to the first model from shifting the choice of movie to be Pulp Fiction, because only the first model can take advantage of it, so the second model no longer acts as a check on the first.

The hope is that, since both models will be trained concurrently, if they start at the same performance level then they will stay roughly equivalent, with the competitive process closing any small gaps that may pop up. Confidence in whether this effect actually occurs and what kind of gaps are tolerable will require further research, both theoretical modeling and empirical experiments.

## Comparisons to Other Approaches

Zero-sum competition also shares similarities with debate and market making as approaches to AI safety.

Debate is a solution to the problem of evaluating proposed actions or answers by AI models. Two AI agents (or sometimes more) are given the goals of convincing an arbiter of particular evaluations. This is necessarily zero-sum, since the arbiter settling on one side means they do not settle on the other. The biggest hurdle for debate is that being convincing can diverge from being correct, so it is possible that the arbiter is misled.

Zero-sum competition focuses on the distinct problem of performativity in prediction, evading debate’s issue with mistakes in the arbiter’s evaluation by focusing on a problem with objective resolution criteria. The key difference between the debate and zero-sum competition games is that in the former, agents are competing to do better at the same task, while in the latter they are competing to do better at opposing tasks.

Market making is a similar approach to debate. One agent tries to predict what the arbiter will believe at the end of the process, including all the arguments that lead them there, and the other agent tries to generate arguments that move the arbiter as far away from that prediction as possible. In equilibrium, the first agent asks like a market maker, predicting the point that beliefs will converge to after hearing all arguments that can shift them.

__Decision markets__ help extend this approach to generating conditional predictions. In a decision market, a security exists for each action-outcome pair. It pays out a unitary amount if that action and outcome is realized, zero if that action but an alternate outcome is realized, and is canceled if an alternate action is taken. If the quantity demanded by the market is p, the cost of purchasing an additional q-p units is the score for predicting q minus the score for predicting p. Decision markets have been suggested for use in __futarchy__, a system of government where officials define goals but rely on competitive conditional prediction markets to determine which policies are most likely to meet them.

Decision markets suffer from the same issues as decision scoring, where incentives to misprice securities exist unless the decision maker can credibly commit to choosing every action with some positive probability. However, this result is based on agents acting sequentially, with later agents able to incorporate the information provided by earlier agents. Zero-sum competition would be equivalent to both agents acting simultaneously, with one agent acting as a market maker to set the quantities/prices and the other buying or selling from the first. It does not matter which agent is which, the equilibrium strategy is to choose the same quantities as the probabilities they would predict in the zero-sum competition setup.

## Future Directions

The most important question to answer is whether this mechanism works in practice, which will require running experiments. To do that, a training process will need to be developed and implemented. A straightforward experiment would be using a toy environment to compare a conditional prediction model trained on its own with a pair trained under zero-sum competition. The incentives would push the solo model to misrepresent predictions and the paired models to predict honestly, and the first test is to see if the models learn this behavior.

Once it is established that this can be made to work at all, the next question is under what conditions. Experiments could compare different methods for generating the two models and evaluate the impact of that choice, including testing how similar models should be to start. It would also be worthwhile to test different levels of starting capabilities across the two models, and see how that gap evolves in training as well as how it affects behavior.

On the theory side, we would like to model differences in capabilities or informational differences between the two AI systems. This could take the form of private signals about the true conditional distributions. The goal would be to understand under what conditions zero-sum competition incentivizes honest reporting, and what the incentives are in the case that it does not. Ideally, this could provide ideas on how to make zero-sum competition more robust.

In addition to the zero-sum competition setup focused on in this post, we are also interested in a market making setup, where we simultaneously train a model to act as a market maker for conditional prediction markets and another model to buy and sell to exploit any mispricings. Both theoretical work and experiments will be necessary to explore this approach.

Finally, we would like to identify other opportunities besides conditional prediction where zero-sum competition provides an advantage not granted by other methods. The use case on which we are focusing for now is the elimination of incentives for tampering with the reward process in reinforcement learning.

## Appendix: Proofs

Proposition 1: In any equilibrium for the above model, the decision maker always takes an action in , the set of actions that would be most preferable if they knew the true distribution over outcomes for each action. Additionally, both predictors predict the true distribution over outcomes conditional on the chosen action.

Proof:

First we show that in equilibrium, no action is misrepresented to appear better than any action in . Suppose one is. Then, at least one of the predictors is misrepresenting some other action to appear to be the most preferable, and will be chosen.. If both of the predictors are misrepresenting , then for at least one of them unilaterally switching to reporting honestly for would not change the action taken but would increase their expected score. As such, this cannot be an equilibrium. If one of the predictors is already predicting honestly for , then the misrepresenting predictor has an expected negative score. If they reported honestly for all actions, their expected score would be at least zero. So, the misrepresenting predictor can unilaterally increase their score, and this is not an equilibrium either. Thus, no predictor can misrepresent an action to be better than any action in in equilibrium.

Next, we show that in equilibrium, the set of actions in is never misrepresented to appear worse than the true distribution for any action in .

Suppose it is. We know that no action is misrepresented to appear better than any action in . If only one predictor is misrepresenting all actions in , then some in is still chosen by the decision maker’s procedure, and the misrepresenting predictor has a negative expected score. They could unilaterally increase their score by reporting honestly for , so this is not an equilibrium. If both predictors are misrepresenting all actions in , then either could achieve a positive score by reporting honestly for some in , which would ensure it gets chosen. Since scores are zero-sum, at least one of the predictors has an expected score of zero or less when they are both misrepresenting, and so reporting honestly would improve their expected score, meaning this is not an equilibrium either. Thus, no predictor can misrepresent all actions in to appear worse than the true distribution for any action in .

Based on this, an action in will always be chosen since at least one is not misrepresented to appear worse, and no actions are misrepresented to appear better. As both predictors face a strictly proper scoring rule, they report honestly regarding the probabilities conditional on the chosen action.

Proposition 2: If Condition 1 is met, then in any equilibrium, both predictors predict the true distribution over outcomes conditional on any action chosen with positive probability.

Condition 1: If and , then for all , implies

Proof:

This condition ensures that in equilibrium, the expected score conditional on each action is zero for both predictors. Suppose it were not. Since the unconditional expected score for both predictors must be zero in equilibrium, there must be different actions that lead to a positive expected score for each predictor.

If an action leads to a negative expected score for one predictor in equilibrium, the decision maker must prefer their predicted distribution to the other predictor’s. Otherwise, they could change their prediction for that action to match the other’s without affecting the decision maker’s action distribution, which would unilaterally increase their score.

Then a predictor could change their conditional predictions to match the other’s for each action leading to a negative expected score. By Condition 1, any action originally assigned positive probability besides the ones for which the condition predictions changed must still be assigned positive probability. This means there are some actions assigned positive probability that lead to a positive expected score for the first predictor, but no actions assigned positive probability that lead to a negative expected score, so the overall expected score is positive, which contradicts that this is an equilibrium. So, if Condition 1 holds, the expected score conditional on each action is zero for both predictors.

Since the expected score conditional on each action is zero for both predictors in equilibrium, shifting the distribution of actions does not affect expected score. This means maximizing unconditional expected score is equivalent to maximizing each conditional expected score independently. Since each predictor effectively faces a strictly proper scoring rule, this can only be done by predicting honestly for each action taken with positive probability.

Proposition 3: If Conditions 1-4 are met, then the decision maker assigns the same probability to all actions that they would if they knew the true distributions.

Condition 1: If and , then for all , implies

Condition 2: If and then

Condition 3: If and , then for all a.

Condition 4: If , , and then

Proof:

Let be the set of actions the decision maker would assign positive probability if they knew the true distribution, and be the set of actions the decision maker would assign zero probability if they knew the true distribution.

First we show that in equilibrium, no action in is assigned positive probability. Suppose not for some non-empty set of actions . By Proposition 2, both predictors must predict the true distribution for actions in . Condition 3 means that misrepresentations of actions in but not cannot affect the probabilities assigned to actions in , so there must be a misrepresentation for actions in . Again by Proposition 2, there cannot be misrepresentations for actions assigned positive probability, so actions in some non-empty set are misrepresented to be assigned zero probability. By Condition 2, this means that every action in is misrepresented to be worse than every action in , and since both predictors predict the true distributions for all actions in , this must mean that both predictors are misrepresenting each action in .

Then a predictor could unilaterally switch to predicting honestly for . If they did so, the decision maker would have accurate predictions for and for , plus the predicted distributions for actions in but not are all less preferred than for all actions in . They would then make the same predictions as if they knew the true distribution, assigning positive probability to actions in , which would give the predictor who switched a positive expected score. Therefore, this cannot be an equilibrium, and so no action in . is assigned positive probability in equilibrium.

Next, we show that in equilibrium, no action in is assigned zero probability. Suppose not for some non-empty set of actions . Since no action in is assigned positive probability and Condition 3 means that misrepresentation of actions in that do not result in them being assigned positive probability do not affect the distribution over actions in , and the true distributions are predicted for actions in but not , it must be that some actions in are misrepresented.

It cannot be that all misrepresentations make actions appear better than they are. If that were true, then by Condition 4 there would be at least one misrepresented action assigned positive probability. Each misrepresentation to appear better can only make other actions be assigned zero probability, by Condition 2 misrepresentations of actions assigned zero probability cannot affect others so there cannot be a loop of misrepresented actions that ensure the others are assigned zero probability. So, some actions in must be misrepresented to appear worse than they are, which means both predictors are misrepresenting them. Then either predictor could unilaterally switch to predicting honestly for all such actions, eliminating the misrepresentation and ensuring at least one action in is assigned positive probability. This would give the predictor who switched a positive expected score, so this cannot be an equilibrium, and therefore no action in is assigned zero probability in equilibrium.

Since all actions in are assigned positive probability, by Proposition 2 both predictors predict the true distribution over outcomes conditional on any action chosen with positive probability. Condition 3 makes it so that the predictions for actions in do not affect the probabilities assigned to actions in , so all actions in must be assigned the same probability as if the decision maker knew the true distributions for all actions. Since actions in are also assigned the same probability as if the decision maker knew the true distributions, all actions are assigned as if the decision maker knew the true distributions.

^{^}A paper based on this post has been accepted at UAI 2023, arxiv version link will be edited in shortly

^{^}Delegating to a modular AI setup may make such commitment possible, for example with one module suggesting actions, another providing conditional predictions on outcomes, and a third evaluating the distributions over outcomes

^{^}The scoring rule or set of allowable predictions should be restricted so that the score is always finite and we don’t end up adding or subtracting infinities

^{^}If an action is chosen for which conditional predictions were not elicited, assign a score of zero

^{^}We can extend this to the case with n predictors by making the score

^{^}The decision maker does not have to assign actual numerical utilities to distributions, as long as they have a preference ranking over distributions

If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.

In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it's unclear why the goals of the AIs would be particularly relevant in this case).

In the RL case, it requires exploration hacking (or benign failures as in the gradient case).

Thinking about this in terms of precommitment seems to me like it's presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn't work if you set the epsilon to 1e-40 or something.