[UPDATE: I have concluded that the argument in this post is wrong. In particular, consider a generative model. Say the generative model has a choice between two possible 'fixed point' predictions A and B, and currently assigns A 70% probability and B 30% probability. Then the target distribution it is trying to match is A 70 % B 30%, so it will just stay like that forever(or drift, likely increasing the proportion of A). This is true even if B is easier to obtain good predictions for -- the model will shift from "70 % crappy model of A / 30% good model to B" --> "70% slightly better model of A / 30% good model of B". It won't increase the fraction of B.

In general this means that the model should converge to a distribution of fixed points corresponding to the learning bias of the model -- 'simpler' fixed points will have higher weight. This might end up looking kind of weird anyway, but it won't perform optimization in the sense I described below.]

In machine learning, we can make the distinction between predictive and agent-like systems. Predictive systems include classifiers or language models. Agent-like systems are the domain of reinforcement learning and include AlphaZero, OpenAI Five, etc. While predictive systems are passive, modelling a relationship between input and output, agent-like systems perform optimization and planning.

It's well-known around here that agent-like systems trained to optimize a given objective can exhibit behavior unexpected by the agent's creators. It is also known that systems trained to optimize one objective can end up optimizing for another, because optimization can spawn sub-agents with different objectives. Here I present another type of unexpected optimization: in realistic settings, systems that are trained purely for prediction can end up behaving like agents.

Here's how it works. Say we are training a predictive model on video input. Our system is connected to a video camera in some rich environment, such as an AI lab. It receives inputs from this camera, and outputs a probability distribution over future inputs(using something like a VAE, for instance). We train it to minimize the divergence between its predictions and the actual future inputs, exponentially decaying the loss for inputs farther in the future.

Because the predictor is embedded in the environment, its predictions are not just predictions; they affect the future dynamics of the environment. For example, if the predictor is very powerful, the AI researchers could use it to predict how a given research direction or hiring decision will turn out, by conditioning the model on making that decision. Then their future actions will depend on what the model predicts.

If the AI system is powerful enough, it will learn this; its model of the environment will include its own predictions. For it to obtain an accurate prediction, it must output a fixed-point: a prediction about future inputs which, when instantiated in the environment, causes that very prediction to come about. The theory of reflective oracles implies that such (randomized) fixed points must exist; if our model is powerful enough, it will be able to find them.

The capacity for agency arises because, in a complex environment, there will be multiple possible fixed-points. It's quite likely that these fixed-points will differ in how the predictor is scored, either due to inherent randomness, logical uncertainty, or computational intractability(predictors could be powerfully superhuman while still being logically uncertain and computationally limited). Then the predictor will output the fixed-point on which it scores the best.

As a simple example, imagine a dispute between two coworkers, Alice and Bob, in the AI lab; each suspects the other of plotting against them. If Bob is paranoid, he could check the predictions of the AI system to see what Alice will do. If the AI system predicts that Alice will publicly denounce Bob, this could confirm his suspicions and cause him to spread rumors about Alice, leading her to publicly denounce him. Or, if the AI system predicts that Alice will support Bob on a key issue, Bob could conclude he was wrong all along and try to reconcile with Alice, leading her to support him on a key issue. The AI system will prefer to output whichever branch of the prediction is simpler to predict.

A more extreme example. If the predictor is VERY superhuman, it could learn an adversarial fixed-point for humans: a speech which, when humans hear it, causes them to repeat the speech, then commit suicide, or otherwise act in a very predictable manner. It's not clear if such a speech exists; but more broadly, in a complex environment, the set of fixed-points is probably very large. Optimizing over that set can produce extreme outcomes.

These same problems could arise within AI systems which use predictors as a component, like this system, which contains a predictive model optimized for predictive accuracy, and a policy network optimizing for some objective. The policy network's decisions will depend on what the predictive model says is likely to happen, influencing what the predictive model ends up seeing. The predictor could then steer the overall system towards more predictable fixed-points, even if those fixed-points obtain less value on the objective the policy network is supposed to be optimizing for.

To some extent, this seems to undermine the orthogonality thesis; an arbitrary predictor can't just be plugged into an arbitrary objective and be counted on to optimize well. From the perspective of the system, it will be trying to optimize its objective as well as it can given its beliefs; but those 'beliefs' may themselves be optimized for something quite different. In self-referential systems, beliefs and decisions can't be easily separated.

Whether or not this happens depends on the learning algorithm. Let's assume an IID setting. Then an algorithm that evaluates many random parameter settings and choses the one that gives the best performance would have this effect. But a gradient-based learning algorithm wouldn't necessarily, since it only aims to improve its predictions locally (so what you say in the ETA is more accurate, **in this case**, I think).

Also, I just wanted to mention that Stuart Armstrong's paper "Good and safe uses of AI oracles" discusses self-fulfilling prophecies as well; Stuart provides a way of training a predictor that won't be victim to such effects (just don't reveal its predictions when training). But then it also fails to account for the effect its predictions actually have, which can be a source of irreducible error... The example is a (future) stock-price predictor: making its predictions public makes them self-refuting to some extent, as they influence market actors decisions.

Yeah, if you train the algorithm by random sampling, the effect I described will take place. The same thing will happen if you use an RL algorithm to update the parameters instead of an unsupervised learning algorithm(though it seems willfully perverse to do so -- you're throwing away a lot of the structure of the problem by doing this, so training will be much slower)

I also just found an old comment which makes the exact same argument I made here. (Though it now seems to me that argument is not necessarily correct!)

Reflective oracles won't automatically do this. They won't minimize log loss or any other cost function. For a given situation, there can be multiple reflective oracles; for example, in a universe M:=O(M,1/2) (i.e. the universe asks the reflective oracle if it equals 1 with probability greater or less than 50%), there are three reflective oracles: P(M)∈{0,1/2,1}. There isn't any defined procedure for selecting which of these reflective oracles is the real one. A reflective oracle that says P(M)∈{0,1} will get a lower average log loss than one that says P(M)=1/2, however these are all considered to be reflective oracles.

Is there a reason you think a reflective oracle (or equivalent) can't just be selected "arbitrarily", and will likely be selected to maximize some score? (In this example there's an issue in that the 1/2 reflective oracle is an unstable equilibrium, so natural ways of finding reflective oracles using gradient descent will be unlikely to find it, however it is possible to set up situations where gradient descent leads to reflective oracles with suboptimal Bayes score.)

My sense is that the simplest methods for finding a reflective oracle will do something similar to finding a correlated equilibrium using gradient descent on each player's strategy individually. This certainly does a kind of optimization, though since it's similar to a multiplayer game it won't correspond to global optimization like finding the reflective oracle with the lowest expected log loss. The kind of optimization it does more resembles "given my current reflective oracle, and the expected future states resulting from this, how should I adjust this oracle to better match this distribution of future states?"

(For more on natural methods for finding (correlated) reflective oracles, I recommend looking at lectures 17-18 of this course and this post on correlated reflective oracles.)

The gradient descent is not being done over the reflective oracles, it's being done over some general computational model like a neural net. Any highly-performing solution will necessarily look like a fixed-point-finding computation of some kind, due to the self-referential nature of the predictions. Then, since this fixed-point-finder is *internal* to the model, it will be optimized for log loss just like everything else in the model.

That is, the global optimization of the model is distinct from whatever internal optimization the fixed-point-finder uses to choose the reflective oracle. The global optimization will favor internal optimizers that produce fixed-points with good score. So while fixed-point-finders in general won't optimize for anything in particular, the one this model uses will.

I think the fixed point finder won't optimize the fixed point for minimizing expected log loss. I'm going to give a concrete algorithm and show that it doesn't exhibit this behavior. If you disagree, can you present an alternative algorithm?

Here's the algorithm. Start with some oracle (not a reflective oracle). Sample ~1000000 universes based on this oracle, getting 1000000 data points for what the reflective oracle outputs. Move the oracle 1% of the way from its current position towards the oracle that would answer queries correctly given the distribution over universes implied by the data points. Repeat this procedure a lot of times (~10,000). This procedure is similar to gradient descent.

Here's an example universe:

M:=if O(M,0.3)=1 then flip(0.9) else 0

Note the presence of two reflective oracles that are stable equilibria: one where P(O(M,0.3)=1)=0, and one where P(O(M,0.3)=1)=1. Notice that the first has lower expected log loss than the second.

Let's parameterize oracles by numbers in [0,1] representing P(O(M,0.3)=1) (since this is the only relevant query). Start with oracle 0.5. If we sample 1000000 universes, about 45% of them have outcome 1. So, based on these data points, P(M())=0.45, so the oracle based on these data points will say P(O(M,0.3)=1)=1, i.e. it is parameterized by 1. So we move our current oracle (0.5) 1% of the way towards the oracle 1, yielding oracle 0.505. We repeat this a bunch of times, eventually getting an oracle parameterized by a number very close to 1.

So, this procedure yields an oracle with suboptimal expected log loss. It is not the case that the fixed point finder minimizes expected log loss. The neural net case is different, but not that much; it would give the same answer in this particular case, since the model can just be parameterized by a single real number.

Reflective Oracles are a bit of a weird case case because their 'loss' is more like a 0/1 loss than a log loss, so all of the minima are exactly the same(If we take a sample of 100000 universes to score them, the difference is merely incredibly small instead of 0). I was being a bit glib referencing them in the article; I had in mind something more like a model parameterizing a distribution over outputs, whose only influence on the world is via a random sample from this distribution. I think that such models should in general have fixed points for similar reasons, but am not sure. Regardless, these models will, I believe, favour fixed points whose distributions are easy to compute(But not fixed points with low entropy, that is they will punish logical uncertainty but not intrinsic uncertainy). I'm planning to run some experiments with VAEs and post the results later.