Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Johannes Treutlein and Rubi Hudson worked on this post as part of SERI MATS, under the mentorship of Evan Hubinger. Rubi has also received mentorship from Leo Gao. We thank Erik Jenner for helpful discussions and Alexander Pan for bringing the performative prediction literature to our attention.

Update 30 May 2023: We have now published a paper based on our previous post and this post (the material from this post is in Appendix D).

1. Introduction

In our previous post, we analyzed a setting in which an oracle AI is maximizing a strictly proper scoring rule while it can influence the world with its predictions about the probabilities of possible outcomes. In this setting, a prediction is called a fixed point if it accurately reflects the oracle's beliefs about the world, after having made that prediction. We showed that an oracle AI that is maximizing a strictly proper scoring rule may choose to misrepresent its beliefs and not output a fixed point, even if one exists. This is bad since, all else equal, we expect better outcomes when decisions are based on accurate rather than inaccurate reports.

In our previous analysis, we assumed that the oracle jointly optimizes both its prediction and the effect of its prediction on the world. In contrast, in this post we examine cases in which the AI only optimizes its prediction—there is a stop-gradient (Foerster et al., 2018; Demski, 2019) in front of the oracle's model of the world, which prevents optimizing outcomes directly.

This type of cognition could result, for instance, from self-supervised training on historical data. In that case, the AI may learn a robust world model, and it may learn to match its predictions to its beliefs about the world. However, the AI may never learn to optimize outcomes through its prediction, since this yields no advantage on the training data, which cannot be influenced by the AI.

Consider a situation where this oracle makes a prediction that can influence the world, and assume it models this influence correctly. Then this could put the AI in a game in which it is trying to make a prediction to match its world model, while the world model updates its beliefs conditional on the AI's prediction. What happens in this game depends on the cognition learned by the AI during training. However, we show that, in contrast to the incentives discussed in our previous post, the only equilibria of this game would be fixed points. Such an oracle could thus be safer if there exists a unique safe fixed point, or if randomly chosen fixed points are likely safe.

In this post, we review and analyze several settings related to stop-gradients with the property that equilibria in them are fixed points or close to fixed points. Our goal is (i) to gain a better understanding of stop-gradient optimization in oracles and (ii) to highlight some potential training setups and agent cognitions that would lead to fixed points.

Similar ideas to the ones outlined above have been discussed in the literature on performative prediction (Perdomo et al. 2020). Performative prediction is a more general framework in machine learning in which the choice of model has an influence on the modeled distribution. Our analysis of oracle AIs is closely related and can be seen as a special case of this setting. We will indicate similarities where appropriate and make use of concepts from that literature.

First, we introduce performative stability and relate it to the game outlined above. We show that performatively stable predictions and equilibria in the game are fixed points. Second, we introduce versions of the repeated risk minimization and repeated gradient descent algorithms. We show that both lead to fixed point predictions. Third, we discuss practical methods based on repeated gradient descent that can be applied to an online learning setting, including Armstrong (2018)'s backwards-facing oracle.

Finally, we discuss no-regret learning and prediction markets. We introduce a no-regret learning framework and show that policies with sublinear regret also have sublinear prediction error. In our decision market model, we show that, if the weight of each trader is small, the equilibria of the market are close to fixed points.

Oracles with stop-gradients may optimize outcomes to find fixed points, which could lead to bad consequences. Thus, the safest oracles would be ones that make predictions only about aspects of the world they cannot influence. However, among oracles that can influence the world, ones with a stop-gradient are preferable for two reasons. First, they report their true beliefs, which gives us better information to base decisions on. This also enables approaches in which we ensure that there is only one safe fixed point. Second, the AI does not optimize its choice of fixed point, which is safer for the standard reasons of not wanting to optimize for an unaligned goal. Which fixed point is chosen will be contingent on initialization and specifics of the fixed point finding procedure.

2. Formal setting

Our setup is analogous to the one in our previous post. We refer the reader to that post for more context. We will generalize our setting to predictions over more than two possible outcomes in an upcoming paper, but in this post we focus on binary predictions for simplicity. We believe that the results in this post generalize to higher dimensions.

An oracle reports a probability for a binary event. A scoring rule is a function  where  is the extended real line. Given prediction  and outcome  the oracle receives the score  We write 

 for the expected score of predicting  given that  is Bernoulli-distributed with parameter  A scoring rule  is called proper if 

for all  It is called strictly proper if this inequality is strict whenever 

As in the previous post, we will use a result from Gneiting and Raftery (2007, Theorem 1).

Theorem 1 (Gneiting and Raftery). Let  be a scoring rule. Then  is strictly proper if and only if there exists a strictly convex function  with subderivatives  (if  is differentiable, this is just the derivative, ) such that 

for any .

To model the oracle's perceived influence on the world, we assume that there is a function  describing how the model's beliefs about the world, , depend on its prediction, . That is, we assume that . We say that  is a fixed point if  i.e., if the oracle's belief is  after updating on making the prediction 

3. Performative stability and game theory

Our discussion of oracles is related to performative prediction (Perdomo et al. 2020). This is a machine learning setting where we choose a model parameter (e.g., parameters for a neural network) that minimizes expected loss (e.g., classification error). In performative prediction, the distribution over data points can depend on the choice of model parameter. Our setting is thus a special case in which the parameter of interest is a probability distribution, the loss is a scoring function, and data points are discrete outcomes. Most results in this and our previous post have analogues in performative prediction.

Translated into our setting, we say that a prediction  is performatively optimal if 

This corresponds to the objective considered in our previous post, of an oracle that maximizes score. By results in that post, performatively optimal predictions are not necessarily fixed points.

Now consider an oracle that chooses predictions  optimally given beliefs , but without optimizing  explicitly. This idea is captured by the definition of performative stability.

Definition 2 (Performative stability (Perdomo et al. 2020)). A prediction  is called performatively stable if 

Note that  is fixed when taking the maximum—there is a "stop-gradient" before  in this objective. It follows that performatively stable points are fixed points.

Proposition 3. Assume  is strictly proper. Then a prediction  is a fixed point if and only if it is performatively stable.

Proof.”. Assume . Then, since  is proper,

for any  Hence, 

”. Assume . Since  is strictly proper, it follows . ◻

Having introduced performative stability and related it to fixed points, we will now discuss the game from the introduction.

Definition 4 (Oracle game). Consider a two-player continuous game in which the first player controls  and the second player controls , with utility functions  and  for the two players, respectively.

If  is a Nash equilibrium of the oracle game, we have  and . Substituting the optimum value  for the second player gives us exactly above definition of performative stability. Conversely, if a prediction  is performatively stable, then setting  yields a Nash equilibrium.

Proposition 5. Assume  is a proper scoring rule. Then for  and  is a Nash equilibrium of the oracle game, if and only if  is performatively stable. By Proposition 3, this is equivalent to  being a fixed point.

The oracle game could be played between two submodules of an oracle, one who is responsible for making predictions and one who is responsible for updating the oracle's beliefs. Those two modules might collapse into just one big module taking a question and outputting a prediction, but it is still useful as a starting point to model them separately. Note that using the squared error for the world model is not essential here. We could also use any other function that is minimized at .

The game could also arise in an agent that uses causal decision theory to maximize its score and that believes that  is influenced causally by , but only acausally by . In that case, the only ratifiable (see Jeffrey, 1983, Ch. 1.7) decision is a Nash equilibrium of the above game. Similarly, the deliberational causal epistemic decision theory discussed by Greaves (2013) would output Nash equilibria of this game (whereas performative optimality would correspond to an agent using evidential epistemic decision theory in this case).

Perdomo et al. (2020) introduce a stackelberg version of the oracle game that produces performatively optimal instead of performatively stable reports. Consider a game in which player  acts first and chooses , after which player  adjusts its prediction  Then player  will choose , so player 's optimization problem becomes 

4. Repeated risk minimization and repeated gradient descent

Above, we have defined an alternative optimization problem and an associated game which yield fixed points, but we have not defined methods for solving these problems. In the performative prediction context, Perdomo et al. (2020) introduce repeated risk minimization and repeated gradient descent, both methods that converge to performatively stable points. In this section, we review both schemes and show how repeated gradient descent can be seen as gradient descent on a stop-gradient objective.

Here, we assume direct access to  instead of having only access to samples distributed according to . In the next section, we discuss online learning when we only have access to samples. One way to understand this distinction is that the former corresponds to the internal cognition of an agent with a belief  optimizing a prediction  The latter corresponds to a machine learning training setup for an oracle, where  is the ground truth distribution instead of the oracle's belief. Of course, there is no strict divide between the two. Any optimization algorithm could be used either by the agent itself or to train the agent.

First, repeated risk minimization is a procedure by which we start with a prediction  and then iteratively update the prediction as . Note that this is equivalent to alternating best response learning in the oracle game, where players  and  alternatingly optimize their actions given the action by the other player. Then player 's update is  and if  is strictly proper, player 's update is  This shows that repeated risk minimization results in fixed point iteration on . Fixed point iteration converges globally to a fixed point if  has Lipschitz constant . It also converges locally to a fixed point  if  is continuously differentiable at  and .

Second, assume that  is differentiable. Then repeated gradient ascent updates predictions via

where  denotes the expectation over outcome  with Bernoulli distribution with parameter  is the projection onto , and  is the learning rate.

Using the definition of , we have 

We can express this as 

where,  is the stop-gradient operator. It evaluates to the identity function but sets gradients to zero,  (Foerster et al., 2018; Demski, 2019). This is not a mathematical function (there is no function that is equal to the identity but has gradient zero everywhere), but rather a notational convention in reference to the stop_gradient or detach functions from tensorflow or pytorch. Interestingly, one can perform valid derivations using the stop-gradient operator (e.g., using the chain rule). We leave it to future work to explore the mathematics behind stop-gradients further.

Importantly, it matters that the gradient in repeated gradient ascent lies inside instead of outside the expectation: 

Unlike repeated gradient ascent, the latter implements gradient ascent on  and thus leads to performatively optimal reports.

Perdomo et al. (2020) show that, given their assumptions, repeated gradient descent globally converges to stable fixed points, and they provide convergence rates. We will show an analogous result relating repeated gradient ascent to fixed points in our setting, though we won't analyze global convergence.

To begin, we show that repeated gradient ascent is equivalent to naive learning (Letcher et al., 2019) in the oracle game, assuming that player  always plays .

Proposition 6. Assume player  is performing gradient ascent on its objective with learning rate , under the assumption that player  plays . Then player 's update is 

Proof. The proof follows immediately from the definitions. Player 's update is, by assumption, 

where  is player 's action. Assuming player  plays , we get 

Next, we show that fixed points are critical points of the fixed-point objective.

Proposition 7. Assume  is proper and let  as in the Gneiting and Raftery characterization (Theorem 1) be differentiable. Then for any  we have 

In particular, if  is a fixed point, it follows that . The reverse is true if 


Finally, we show that in our setting, repeated gradient ascent locally converges to fixed points , with linear convergence rate, assuming that  and .

Theorem 8. Let  be a strictly proper scoring rule. Let  be a fixed point of  such that  is twice continuously differentiable at  and . Moreover, assume  is continuously differentiable at  and . Then, for small enough , there exists an open set  with  such that for , an agent taking updates  will linearly converge to .

Proof. In Appendix A. This is a standard convergence proof. Consider the discrete dynamical system defined by the agent's updates  By Proposition 7, it has a fixed point at . We show that . This means that the fixed point is stable and thus the system will locally converge to it given small enough . ◻

5. Online learning

Now consider a machine learning setup in which we train an oracle with stochastic gradient ascent on environment samples. We assume that at time  a model makes a prediction  and receives a score  where  is Bernoulli-distributed with parameter  The model is then updated using gradient ascent on  That is, for some learning rate schedule  we have 

We discuss this as a theoretical model for oracles trained using machine learning, to show how training setups may incentivize predicting fixed points. There are many issues with the setting beyond giving accurate predictions; for instance, even if the training process sets the right incentives on training examples, the learned model may be optimizing a different objective when generalizing to new predictions.

To see that above setup incentivizes predicting fixed points, note that 

That is, the expectation of this gradient, conditional on  is exactly the repeated gradient from the previous section. Hence, given the right assumptions, this converges to fixed points instead of performative optima. We do not show this here, but an analogous result in performative prediction was proved by Mendler-Dünner et al. (2020).

There are several variations of this setup that essentially set the same incentives. For instance, one could also draw entire batches of outcomes  and then perform updates based on the batch gradient  This is a monte carlo estimate of the repeated gradient and thus also converges to performatively stable points and thus fixed points (Perdomo et al., 2020). One could also mix the two algorithms and, e.g., perform gradient ascent on an average of past losses, yielding a version of the backwards-facing oracle discussed in Armstrong (2018). In Appendix D, we show that that oracle can only converge to fixed points.

Note that finding fixed points depends on the fact that we differentiate  instead of the expectation  If we used policy gradients to differentiate , for instance, we would again optimize for performative optimality. Similarly, we could learn a Q-function representing scores for each prediction, and update the function based on randomly sampled predictions . Then the Q-function would converge to estimates of , and the highest Q-value would be a performative optimum. There are also some more recent papers in performative prediction that explicitly try to estimate the gradient  and thus find performatively optimal instead of stable points (Izzo et al., 2021).

Stop-gradients could also be circumvented in a hidden way (Krueger et al., 2020). For instance, consider a hyperparameter search to meta-learn a learning algorithm, where the evaluation criterion is the accumulated loss during an episode. Then this search would prefer algorithms that optimize  directly, without a stop-gradient.

Lastly, repeated gradient descent is related to decoupled approval in RL (Uesato et al., 2020). The decoupled approval policy gradient samples actions and approval queries independently and can thus differentiate with a stop-gradient in front of the approval signal. In our setting, we can differentiate through  directly, so it is not necessary to calculate this gradient with a decoupled policy gradient. Decoupled gradients could be used to implement the stop-gradient objective if scores were discrete or otherwise not differentiable.

6. No-regret learning

In this section, we consider no-regret learning and show that algorithms have sublinear regret if and only if their prediction error is sublinear. Regret takes the environment probabilities as given and asks which predictions would have been optimal in hindsight. It thus puts a “stop-gradient” in front of those environment probabilities.

As in the previous section, we assume that at time  the agent makes a prediction  and receives a score , where  The agent's cumulative score at step  is defined as . In no-regret learning, we compare performance against experts, which choose sequences of probabilities  . We assume that an expert's prediction  is independent of  conditional on . I.e., an expert knows the predictions  and thus probabilities , but it does not know the outcome of . Let  the set of all such experts.

The regret of the agent is the difference between the cumulative score received by the best expert in expectation and the cumulative score received by the agent. To define it formally, let 

for  is a random variable that maximizes the expectation of  before  is drawn, but conditional on .

Definition 9 (Regret). The regret of agent  at time  is 

 The agent is said to have sublinear regret or no-regret if 

First, note that we define regret relative to the best expert in expectation instead of the best expert in hindsight. The latter would always be the one that made confident predictions and accidentally got all predictions exactly right. To achieve sublinear regret, it would thus be too much to ask the agent to perform well compared to the best expert in hindsight. Moreover, for scoring rules with  this expert would have a constant score  such that  This would reduce the problem to minimizing the negative score and thus finding performatively optimal predictions.

Second, we evaluate the performance of the expert with respect to the environment outcomes  generated by the agent  instead of evaluating the expert according to outcomes  generated using their own predictions. This means that, to receive sublinear regret, the agent only has to make accurate predictions—it does not have to find a performatively optimal prediction. Our setting is thus different from the no-regret learning setup discussed in Jagadeesan et al. (2022), where regret is defined with respect to  In that setting, only agents converging to performatively optimal predictions have sublinear regret.

We begin by showing that the best expert in expectation actually exists, and that .

Proposition 10. Let  be a proper scoring rule and  an expert. Then for any  we have 

Moreover, we have  and thus 

Proof. Let  and let  be any expert. Conditional on  has parameter  and is independent of  by assumption. Hence, 

Next, since  is proper, 

 It follows that 

Moreover, , as  is constant given  and thus independent of .

It follows that, for any  , and thus 

If  is unbounded (such as the log scoring rule), then the agent's scores can become arbitrarily low, and the limit of  may be undefined. To simplify our analysis, we will thus assume that there is a bound on the variance of the received score  and on the expected score  of both the agent  and the best expert . In the case of the log scoring rule, this would be satisfied, for instance, if the agent's predictions are bound away from  and .

Our next proposition shows that, given these assumptions, the limit  exists and is nonnegative, and sublinear regret is equivalent to 

Proposition 11. Let  be a proper scoring rule. Assume that  and that  for . Then almost surely 

In particular, almost surely both limits exist and are finite, and the agent has sublinear regret if and only if 

Proof. In Appendix B. ◻

Now we turn to the main result for this section. We show that given our assumptions, agents have sublinear regret if and only if their prediction error is sublinear. Note that here, we do not require the  to converge; they could also oscillate between different fixed points.

Theorem 12. Let  be the sequence of the agent's predictions and  a strictly proper scoring rule. Assume that  for , and assume that there exists a compact set  such that  for all  and  and  are continuous in  at any . Then almost surely the agent has sublinear regret if and only if  is sublinear, i.e., if .

Proof. In Appendix C. ◻

The next result shows that if the agent's probabilities converge to some probability , then  must be a fixed point.

Corollary 13. In addition to the assumptions from Theorem 12, assume that  converges almost surely to a limit . Then almost surely  is a fixed point if and only if the agent has sublinear regret.

Proof. By Theorem 12, almost surely the agent has sublinear regret if and only if 

It remains to show that, given that the  converge, the latter is equivalent to convergence to a fixed point.

Since  is compact and  for all  also  Hence,  is continuous at  so 

Since this sequence converges, it is equal to its Cesàro mean,


 It follows that, if  then 

This shows that, almost surely,  is a fixed point, if and only if  is sublinear. ◻

7. Prediction markets

Lastly, we consider prediction markets. We assume a simplified model of a prediction market, in which traders submit a single prediction and get scored using a proper scoring rule. The prediction that is output by the market and that influences the outcome is just a weighted average of the individual traders' predictions. In this situation, if a trader has a small weight and can thus barely influence the market prediction, the trader's score will mostly be determined by the accuracy of the report, rather than the influence of the report on the market. Thus, if all traders are small relative to the market, the equilibrium prediction will be close to a fixed point.

A similar result was shown by Hardt et al. (2022) in the performative prediction context. They define a firm's performative power as the degree to which the firm can influence the overall outcome with their prediction. Hardt et al. (2022) show that in an equilibrium, the distance between a player's (performatively optimal) equilibrium strategy and their strategy when optimizing loss against the fixed equilibrium distribution (here, this means predicting the market probability) is bounded by the power of the trader. We give an analogous result for our formal setting and assumptions.

To formalize the setting, assume that there are  players. We associate with each player  a number  such that , representing, intuitively, what fraction of the overall capital in the market is provided by player . In the game, all players simultaneously submit a probability . Then the event  occurs with probability . Finally, each player is scored in proportion to  for some strictly proper scoring rule . Typical market scoring rules would consider terms like  but subtracting  (or multiplying by constants) does not matter for the game. We assume that players maximize their expected score, 

For discussions of market scoring rules, see Hanson 2003 and Sami and Pennock 2007. Prior work has connected these market scoring rules to more realistic prediction markets that trade Arrow--Debreu securities markets such as PredictIt (Hanson 2003; Hanson 2007; Pennock and Sami 2007, Ch. 4; Chen and Pennock 2007; Agrawal et al. 2009; Chen and Vaughan 2010).

We assume that  is common knowledge. Moreover, in the following we only consider pure strategy equilibria, and we do not investigate the existence of equilibria.

Theorem 14. Let  be a proper scoring rule and let  as in the Gneiting and Raftery characterization. Let  be a pure strategy Nash equilibrium of the game defined above and let  be the market prediction. Assume  is differentiable at . For any player , if  are differentiable at  and  it follows that 

Whenever , the bound is tight.

In particular, this theorem shows that players  with very low  (little capital/influence on ) will accurately predict . Note, however, that  is not necessarily a fixed point or close to a fixed point. If there are are also players  with very high , then their prediction and the overall market prediction may be wrong. (So interestingly the overall market probability  is worse than the prediction of individuals. One might take this to suggest that anyone interested in  should look at the latter type of predictions. Of course, if this is what everyone does, it is not so clear anymore that the model  is accurate.)

Proof. The proof is analogous to the proof of Proposition 9 in our previous post.

For  to be a pure strategy Nash equilibrium, each player must play a best response to the other player's strategies. That is,  must be a global maximum of the function . Therefore, the derivative of this function must be zero if  and positive/negative if /.

The derivative is 

Now, if , we have 

Rearranging terms and taking the absolute value, it follows 

Next, assume  is the optimal report for player  Then we must have 

By Theorem 1,  is convex and thus  Hence, the above is equivalent to . Since , it follows . Hence, 

Finally, if , then analogously  and since  it follows again  This concludes the proof. ◻

Corollary 15. In addition to the assumptions from Theorem 14, assume that  is Lipschitz-continuous and that  Let  be a Nash equilibrium and let  arbitrary. Then there exists a  such that if for all  all of  and , for all , as well as  and  are within  of each other.

Proof. Let  arbitrary. Let  be the Lipschitz constant of  and note that then  for all  By Theorem 14, it follows for  and any player  that 

Now let  and  Then, assuming  for all  it follows 

Moreover, since  is a convex combination of probabilities  it follows that also 

Thus, by the triangle equality, we have  and since  is Lipschitz-continuous,

for any 

This shows that all of  are within  of  and thus by the triangle inequality within  of each other. This concludes the proof. ◻

It would be interesting to extend these results. For example, it's already not so clear if the players make predictions repeatedly. (To keep things simple, one should probably still imagine that all players know  and that the environment probability is determined by  applied to the majority forecast. If the traders have private information, prediction markets become harder to analyze. For some discussions, see Ostrovsky 2012, Chen and Waggoner 2016.)

Appendix A. Proof of Theorem 8

We use the following theorem, adapted from "Iterative Solution of Nonlinear Equations in Several Variables" (Ortega and Rheinboldt, 2000).

Theorem 16 (Ostrowski). Assume that  has a fixed point  and is continuously differentiable at . If  for all eigenvalues  of , then there exists an open set  with  such that for any , letting  for  it is  for all  and  converges at least linearly to .

To begin, assume that  is a fixed point of , that  is continuously differentiable at , and that . Moreover, let  as in the Gneiting and Raftery representation of , and assume that  is twice continuously differentiable at  and thus  is continuous at .

Define the function 

Note that the agent's update rule is then  defined via .

First, by Proposition 7, we know that for any  we have , and if  is a fixed point of , we must have  and hence . Hence,  is a fixed point of  and thus also of .

Now we will apply Theorem 16 to . To that end, let 


since  Moreover, by assumption,  and . Hence, it follows that .

Now note that 

Hence, for small enough , it is  Moreover, by assumption,  and  are continuous, so also the derivative  is continuous.

Now we can apply Ostrowski's theorem, which tells us that, if , there is an open set  with , such that for any , the iterates  all lie in  and  converges at least linearly to .

This shows that if  is an interior point, we can choose  small enough to make sure that  and thus  on , and for  the iteration  locally converges to .

Now assume that  (the proof for  is analogous). We can extend the function  for values  by the Whitney extension theorem. This theorem tells us that if  is continuously differentiable, then there also exists  such that . We can then apply the above argument with Ostrowski's theorem to the function