Johannes Treutlein and Rubi Hudson worked on this post while participating in SERI MATS, under Evan Hubinger's and Leo Gao's mentorship respectively. We are grateful to Marius Hobbahn, Erik Jenner, and Adam Jermyn for useful discussions and feedback, and to Bastian Stern for pointing us to relevant related work.

Update 30 May 2023: We have now published a paper based on this post. In this paper, we also discuss in detail the relationship to the related literature on performative prediction.

Introduction

One issue with oracle AIs is that they might be able to influence the world with their predictions. For example, an AI predicting stock market prices might be able to influence whether people buy or sell stocks, and thus influence the outcome of its prediction. In such a situation, there is not one fixed ground truth distribution against which the AI's predictions may be evaluated. Instead, the chosen prediction can influence what the model believes about the world. We say that a prediction is a self-fulfilling prophecy or a fixed point if it is equal to the model's beliefs about the world, after the model makes that prediction.

If an AI has a fixed belief about the world, then optimizing a strictly proper scoring rule incentivizes it to output this belief (assuming the AI is inner aligned to this objective). In contrast, if the AI can influence the world with its predictions, this opens up the possibility for it to manipulate the world to receive a higher score. For instance, if the AI optimizes the world to make it more predictable, this would be dangerous, since the most predictable worlds are lower entropy ones in which humans are more likely dead or controlled by a misaligned AI. Optimizing in the other direction and making the world as unpredictable as possible would presumably also not be desirable. If, instead, the AI selects one fixed point (of potentially many) at random, this would still involve some non-aligned optimization to find a fixed point, but it may be overall less dangerous.

In this post, we seek to better understand what optimizing a strictly proper scoring rule incentivizes, given that the AI's prediction can influence the world. Is it possible for such a rule to make the AI indifferent between different fixed points? Can we at least guarantee that the AI reports a fixed point if one exists, or provide some bounds on the distance of a prediction from a fixed point?

First, we show that strictly proper scoring rules incentivize choosing more extreme fixed points, i.e., fixed points that push probabilities to extreme values. In the case of symmetric scoring rules, this amounts to making the world more predictable and choosing the lowest entropy fixed point. However, in general, scoring rules can incentivize making specific outcomes more likely, even if this does not lead to the lowest entropy distribution.

Second, we show that an AI maximizing a strictly proper scoring rule will in general not report a fixed point, even if one exists and is unique. In fact, we show that under some reasonable distributions over worlds, optimal predictions are almost never fixed points. We then provide upper bounds for the inaccuracy of reported beliefs, and for the distance of predictions from fixed points. Based on these bounds, we develop scoring rules that make the bounds arbitrarily small.

Third, we perform numerical simulations using the quadratic and the logarithmic scoring rule, to show how the inaccuracy of predictions and the distance of predictions from fixed points depend on the oracle's perceived influence on the world via its prediction. If its beliefs about the world depend on its predictions affine-linearly with slope , then our bounds are tight, and inaccuracy under the log scoring rule scales roughly as . E.g., at , predictions can be off by up to .

Our results support the prior idea that an oracle will try to make the world more predictable to get a higher score. However, importantly, we show that it is not even clear whether an oracle would output a self-fulfilling prophecy in the first place: how close the oracle's prediction will be to a fixed point depends on facts about the world and about the scoring rule. The problem thus persist even if there is a unique fixed point or no fixed point at all. It would not be sufficient for safety, for instance, to build an oracle with a unique and safe fixed point.

In general, fixed points can lead to arbitrarily bad outcomes (e.g., Armstrong, 2019). Most prior work has thus tried to alleviate problems with fixed points, e.g., by making the oracle predict counterfactual worlds that it cannot influence (Armstrong and O'Rorke, 2018). We are not aware of any prior work on specifically the question of whether oracles would be incentivized to output fixed points at all. We would be interested in pointers to any related work.

Work on decision markets (e.g., Chen et al., 2011; Oesterheld and Conitzer, 2021) tries to set the right incentives for predictions that inform decisions. However, in the decision market setup, predictions only affect the world by influencing the principal's final decision, whereas we consider an arbitrary relationship between world and prediction. Unlike in our setup, in the decision market setup, there exist scoring rules that do incentivize perfectly accurate reports. There is also some literature on scoring rules discussing agents manipulating the world after making a prediction (e.g., Oka et al., 2014; Shi, Conitzer, and Guo, 2009), though to our knowledge the cases discussed do not involve agents influencing the world directly through their predictions.

A related topic in philosophy is epistemic decision theory (e.g., Greaves 2013). Greaves (2013) introduces several cases in which the world depends on the agent's credences and compares the verdicts of different epistemic decision theories (such as an evidential and a causal version). While some of Greaves' examples involve agents knowably adopting incorrect beliefs, they require joint beliefs over several propositions, and Greaves only considers individual examples. We instead consider only a single binary prediction and prove results for arbitrary scoring rules and relationships between predictions and beliefs.

Formal setup

Binary prediction setup

We focus on a simple case in which the oracle makes a prediction by outputting a probability  for a binary event. A scoring rule  takes in a prediction  and an outcome  and produces a score . For a given scoring rule, define , the expectation of  given that  is Bernoulli distributed with parameter . Here,  represents the model's actual belief,  its stated prediction, and  the model's expected score, which the model is trying to maximize.  is strictly proper iff for any given  has a unique maximum at .

Example 1 (Logarithmic scoring rule). The logarithmic scoring rule is defined as 

and . This is also the negative of the cross-entropy loss employed in training current large language models. One can show that it is strictly proper.

Example 2 (Quadratic scoring rule). Another strictly proper scoring rule is the quadratic score, defined as