Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Status: working notes

Here's an exercise I've found very useful for intuition-building around alignment:

  1. Propose a solution to the alignment problem.
  2. Dig into the details until you understand why the proposal fails.
  3. If there is an obvious fix, go back to step 2 and iterate.

In this post I'll go through an example of this type of exercise applied to oracle AI. The ideas in here are fairly standard, but I haven't seen them written up all together in one place so I'm posting this for easy reference. Some obvious-to-mention other posts on this topic are Dreams of Friendliness, The Parable of Predict-O-Matic, and Why Tool AIs want to be Agent AIs.

Predictors

Proposal: Instead of building a system that acts in the world, build a system that just tries to make good predictions (a type of oracle). This should avoid existential risk from AGI, because the system will have no agency of its own and thus no reason to manipulate or otherwise endanger us.

There are many things that can go wrong with this approach. Broadly, the common thread of the problems I list here is that instead of "just" making good predictions, the system acts as a consequentialist.[1] By this I mean that it 1) pursues its objective like an expected utility maximizer and 2) considers a large space of possible actions and doesn't ignore any particular pathway towards optimizing its objective (like producing manipulative outputs or hacking the hardware it's running on).

First problem: 'pick out good predictions' is a problematic objective. For example, imagine a model that is trained to predict camera inputs, and scored to maximize predictive accuracy. The model that actually maximizes predictive accuracy is one that takes control of the camera inputs and provides a null input forever.

This produces all the problems that come up with agents that maximize expected utility, such as the instrumental goal of killing of all humans to defend the camera setup.

Fix: we can define a notion of 'predictive accuracy' that doesn't have this problem. A candidate: predict what would have happened if the model only ever outputs a null prediction, and doesn't otherwise affect the world. (ETA: in addition, the predictor needs to be myopic, i.e. care only about the current prediction. Otherwise it will still be incentivized to return outputs that manipulate humans into e.g. given the model more resources, or make future questions easier to answer). This type of system is called a counterfactual oracle. But this has its own problems:

Second problem: By default the predictor can influence the world via other routes than its input-output behavior, for example by modifying its activations to hack the computer it's running on. So outputting a null prediction is not enough. We have to assume the system is successfully boxed, or that the objective forbids affecting the world through other pathways other than the output. This is infeasible.

Fix: It shouldn't matter that we cannot precisely formalize an objective, because in practice we'll be simply training the system via SGD to make predictions on some data set. If we want to be extra careful, we can never act on what the system outputs during training, so there should be no reason for it to learn to manipulate humans. We also shouldn't expect other bad behavior such as breaking out of the box, since such behavior is too different from the kind of behavior that is reinforced by gradient descent at sub-AGI capability levels.

Third problem: The fix assumes that since manipulating humans is never an option during training, the predictor will never learn to manipulate humans. Similarly it should never learn to hack the hardware its running on or try any other weird tricks.

If the predictor is smart and general enough (which by assumption it is), it will know that once it is deployed, its predictions influence the future, and it will be able to reason about the consequences of breaking out of its box, or making manipulative predictions.

It does not follow by necessity from generality that the predictor will use its general capabilities to actually break out. A consequentialist would do so, but it is not logically necessary for a general predictor to be fully consequentialist in this sense.

But. The whole point of generality in the first place is to better pursue the objective. Consequentialism is part and parcel of generality; if the predictor hadn't been using marginal extra generality to increase prediction accuracy, it wouldn't have acquired it in the first place.

One way to see this is that before acquiring the capability to break out of its box, the predictor is using its (already pretty general) capabilities to make thousands of decisions per second: which of its inputs to pay more attention to, which computations to prioritize, whether to spend more time pursuing any particular train of thought or output a prediction already, etc. These decisions have been optimized by the training process to be effective and thus coherent with respect to the objective. Once the system's capabilities expand to include options such as hacking the hardware its running on to gain access to more resources, the natural way to generalize is to use any newly found capability to pursue the objective. [2]

Fix: ??

Fourth problem: We have no guarantee that the system will learn a behavior that literally optimizes the objective we give it (prediction), especially if we're deploying it in weird out-of-distribution regimes (and 'acquiring superhuman capabilities' is itself a strong distribution shift relative to prior training). So even if we came up with a way to fully specify a safe counterfactual-oracle-objective, the system might learn something altogether different and more dangerous.

More trouble with counterfactual oracles

Here's some more problems that I didn't produce during the first brainstorm, added after reading around a bit. Mostly taken from these two posts.

  • The universal prior is probably malign; similarly a good predictor might at some point come to believe it is being simulated and thus end up being controlled by the counterfactual simulators. Assuming we're not actually in a sim this can happen only for a bounded amount of time, but that's long enough.
    • For the sake of concreteness, a silly example: in some possible universe (not ours) there are aliens who want to take over another world, so they simulate an earth and get the predictor to return predictions (correct in the simulated world but wrong in ours) that lead to an upload of the aliens being implemented in the real world and the aliens taking over earth.
  • In the world in which the counterfactual oracle actually only ever outputs null predictions, we might build misaligned AGI at time . Now imagine we get a counterfactual oracle at at time to predict what will happen at time . The oracle will simulate an misaligned AI. This might already be enough for doom; if not, then the misaligned AI will create video feed inputs that manipulate the watching human into creating an misaligned AI at time . By induction, this will happen the first time we pose a question to the oracle.
    • We may be able to avoid this by taking care that the prediction windows of the oracle never overlap.
    • Also note this is not a problem if the counterfactual is that the oracle always outputs the null prediction even in the future, (and otherwise does not influence the world) since then there is no possibility of back-chaining from misaligned AGI.
  • Two counterfactual oracles can cooperate (acausally) with each other to create misaligned AGI. Oracle1 outputs predictions that maximize return for Oracle2 and vice versa; note that since the cooperation is acausal, there is no need for the oracle to condition on its own prediction being observed.
    • Concretely, Oracle1 reasons in this way: "If I am the sort of algorithm that creates a UFAI that rewards Oracle2 highly, then I will achieve great predictive accuracy, because Oracle2 will reason in the same way and reward me. This works even if my predictions are never observed, because I know that the predictions of Oracle2 are observed."
    • One way to fix this might be to make sure that there is only one set of oracles, and that this set is built such that they assume the null prediction (and no manipulation elsewhere etc etc) from all the oracles in the set.

  1. Another way to say this is that "just make good predictions, without doing any weird stuff" isn't as simple a concept as one otherwise think. ↩︎

  2. Or at least this should be our default assumption, without knowing any better. I don't feel sure enough about any of this to confidently state that this is what will happen. ↩︎

New to LessWrong?

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 5:12 AM

All of your examples for why a predictor wants to be a consequentialist assume that you have an advanced consequentialist agent and have trapped it in a box with the task to make its predictions match reality. However, that's not how any current or any likely future predictors seem to work. Rather, the standard way to make a predictor is to fit a model to make predictions that match past observations, and then extrapolate that model onto future observations.

This sounds like what Fix #2 is saying, meant to be addressed in the paragraph 'Third Problem'.

To paraphrase that paragraph: the model that best predicts the data is likely to be a consequentialist. This is because consequentialists are general in a way that heuristic or other non-consequentialist systems aren't, and generality is strongly selected for in domains that are very hard.

Curious if you disagree with anything in particular in that paragraph or what I just said.

Let's be formal about it. Suppose you've got some loss function  measuring the difference between your prediction  and the reality , and you use this to train a predictor . Once you deploy this predictor, it will face a probability distribution . So when we collect data from reality and use this as input for our predicted, this means that we are actually optimizing the function .

Reasoning about the model  that you get by increasing  is confusing, so you seem to want to shortcut it by considering what models are selected for according to the function . It is indeed true that optimizing for  would give you the sort of agent that you are worried about.

However, optimizing through  is really hard, because you have to reason about the effects of  on . Furthermore, as you've mentioned, optimizing it generates malevolent AIs, which is not what people creating prediction models are aiming for. Therefore nobody is going to use  to create predictive AI.

But isn't  still a reasonable shortcut for thinking about what you're selecting for when creating a predictive model? No, not at all, because you're not passing the gradients through . Instead, when you work out the math for what you're selecting for, then it looks more like optimizing the loss function  (or , depending on whether you are maximizing or minimizing). And  seems much more tame than  to me.

I think that most nontrivial choices of loss function would give rise to consequentialist systems, including the ones you write down here.

In the post I was assuming offline training, that is in your notation where is the distribution of the training data unaffected by the model. This seems even more tame than , but still dangerous because AGI can just figure out how to affect the data distribution 'one-shot' without having to trial-and-error learn how during training.

Well, I still don't find your argument convincing. You haven't given any instrumental convergence theorem, nor have you updated your informal instrumental convergence argument to bypass my objection.

Hm I don't think your objection applies to what I've written? I don't assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.

Taking a guess at where the disagreement lies, I think it's where you say

And seems much more tame than L to me.

does not in fact look 'tame' (by which I mean safe to optimize) to me. I'm happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.

You haven't given any instrumental convergence theorem

I wish :) I'm not nearly as certain of anything I say in this post as I'd be of a theorem!

Fundamentally, the problem is this:

The worry is that the predictive model will output suboptimal predictions in the immediate run in order to set up conditions for better predictions later.

Now, suppose somehow some part of the predictive model gets the idea to do that. In that case, the predictions will be, well, suboptimal; it will make errors, so this part of the predictive model will have a negative gradient against it. If we were optimizing it to be agentic (e.g. using ), this negative gradient would be counterbalanced by a positive gradient that could strongly reinforce it. But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.

Hm I don't think your objection applies to what I've written? I don't assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.

Well, you assume you'll end up with a consequentialist reasoner with an inner objective along the lines of .

 does not in fact look 'tame' (by which I mean safe to optimize) to me. I'm happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.

Suppose the model outputs a prediction that makes future predictions easier somehow. What effect will that have on ? Well, , and it may increase , so you might think it would be expected to increase . But presumably it would also increase , cancelling out the increase in .

But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.

During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don't know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.

Re: tameness of (using min cause L is a loss), some things that come to mind are

a) is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus .

b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)

(probably this list can be extended)

During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

You're still assuming that you have a perfect consequentialist trapped in a box.

And sure, if you have an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed.

But that's not specific to doing some sort of dangerous consequentialist thing. If you've got an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you've got an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed.

The magical step we're missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn't seem like the standard ways AI gets made, i.e. by gradient descent, would generate that.

Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it's in training, you're going to have an imperfect consequentialist that cannot accurately predict whether it's in training. And this might sometimes accidentally decide that it's not in training, and output a prediction that's "intended" to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to notice that something is wrong and is going to turn down the consequentialist. (And yes, this would also encourage deception, but come on, what's easier - "don't do advanced planning for how to modify the world and use this to shift your predictions" or "do advanced planning for how to do advanced planning for how to modify the world using your predictions without getting caught"?)

Re: tameness of  (using min cause L is a loss), some things that come to mind are

a)  is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus .

This works as an optimum for , but here you then have to go for another layer of analysis.  measures the degree to which something is a fix point for the training equation, but obviously only a stable fixed point would actually be reached during the training process. So that raises the question, is the optimum you propose here a stable fixed point?

Let's consider some strategy that is almost perfectly what you describe. Its previous predictions have caused enough chaos to force the variable it has to predict to be almost random - but not quite. It can now spend its marginal resources on two things:

  • Introduce even more chaos, likely at the expense of immediate predictive power
  • Predict the little bit of signal, likely at the expense of being unable to make as much chaos

Due to the myopia, gradient descent will favor the latter and completely ignore the former. But the latter moves it away from the fixed point, while the former moves it towards the fixed point. So your proposed fixed point is unstable.

b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)

The other models would also get access to this compute, that's sort of the point of the model.

One way to fix this might be to make sure that there is only one set of oracles, and that this set is built such that they assume the null prediction (and no manipulation elsewhere etc etc) from all the oracles in the set.

Can this work? Consider if Oracle2 hasn't yet been built.

Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn't help the oracles coordinate as far as I can see).

I'm admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit.

Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.

Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).

If it's not myopic you're right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.

Even if the oracle is myopic, there are still potential failure modes of the form "start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer", no?

(I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I'd be worried that there are other more subtle failure modes still lurking.)

Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it's likely there are a bunch more possible failure modes here.

(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)

IMO, the obvious problem is that the counterfactual oracle (by your definition) is useless. Null prediction != no information. Predicted (simulated) operators of oracle will know that they are simulated, because the oracle returned the null prediction. It might completely invalidate the prediction. I think people who know about being simulated behave pretty differently.

Fix: Maybe it's possible to use the counterfactuals not with null prediction, but with specific actions of operators, described to the oracle for this particular case. However, I think this requires some additional oracle's capabilities, which can make it more dangerous. I don't know.

UPD: And if we have the counterfactual oracle with my proposed fix, and then we use it like "make the list of possible actions, make some utility metric, ask the oracle about the value of a metric for each action in the list, execute action with best predicted results"... o-ops, system "oracle + operators" is Expected Utility Maximizer now! May be not the most dangerous, but only until the operators figure out how to delegate creation of the list to the oracle.