Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

New to LessWrong?

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 7:13 PM

All of your examples for why a predictor wants to be a consequentialist assume that you have an advanced consequentialist agent and have trapped it in a box with the task to make its predictions match reality. However, that's not how any current or any likely future predictors seem to work. Rather, the standard way to make a predictor is to fit a model to make predictions that match past observations, and then extrapolate that model onto future observations.

This sounds like what Fix #2 is saying, meant to be addressed in the paragraph 'Third Problem'.

To paraphrase that paragraph: the model that best predicts the data is likely to be a consequentialist. This is because consequentialists are general in a way that heuristic or other non-consequentialist systems aren't, and generality is strongly selected for in domains that are very hard.

Curious if you disagree with anything in particular in that paragraph or what I just said.

Let's be formal about it. Suppose you've got some loss function  measuring the difference between your prediction  and the reality , and you use this to train a predictor . Once you deploy this predictor, it will face a probability distribution . So when we collect data from reality and use this as input for our predicted, this means that we are actually optimizing the function .

Reasoning about the model  that you get by increasing  is confusing, so you seem to want to shortcut it by considering what models are selected for according to the function . It is indeed true that optimizing for  would give you the sort of agent that you are worried about.

However, optimizing through  is really hard, because you have to reason about the effects of  on . Furthermore, as you've mentioned, optimizing it generates malevolent AIs, which is not what people creating prediction models are aiming for. Therefore nobody is going to use  to create predictive AI.

But isn't  still a reasonable shortcut for thinking about what you're selecting for when creating a predictive model? No, not at all, because you're not passing the gradients through . Instead, when you work out the math for what you're selecting for, then it looks more like optimizing the loss function  (or , depending on whether you are maximizing or minimizing). And  seems much more tame than  to me.

I think that most nontrivial choices of loss function would give rise to consequentialist systems, including the ones you write down here.

In the post I was assuming offline training, that is in your notation where is the distribution of the training data unaffected by the model. This seems even more tame than , but still dangerous because AGI can just figure out how to affect the data distribution 'one-shot' without having to trial-and-error learn how during training.

Well, I still don't find your argument convincing. You haven't given any instrumental convergence theorem, nor have you updated your informal instrumental convergence argument to bypass my objection.

Hm I don't think your objection applies to what I've written? I don't assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.

Taking a guess at where the disagreement lies, I think it's where you say

And seems much more tame than L to me.

does not in fact look 'tame' (by which I mean safe to optimize) to me. I'm happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.

You haven't given any instrumental convergence theorem

I wish :) I'm not nearly as certain of anything I say in this post as I'd be of a theorem!

Fundamentally, the problem is this:

The worry is that the predictive model will output suboptimal predictions in the immediate run in order to set up conditions for better predictions later.

Now, suppose somehow some part of the predictive model gets the idea to do that. In that case, the predictions will be, well, suboptimal; it will make errors, so this part of the predictive model will have a negative gradient against it. If we were optimizing it to be agentic (e.g. using ), this negative gradient would be counterbalanced by a positive gradient that could strongly reinforce it. But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.

Hm I don't think your objection applies to what I've written? I don't assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.

Well, you assume you'll end up with a consequentialist reasoner with an inner objective along the lines of .

 does not in fact look 'tame' (by which I mean safe to optimize) to me. I'm happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.

Suppose the model outputs a prediction that makes future predictions easier somehow. What effect will that have on ? Well, , and it may increase , so you might think it would be expected to increase . But presumably it would also increase , cancelling out the increase in .

But since we're not doing that, there's nothing to counteract the negative gradient that removes the inner optimizer.

During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don't know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.

Re: tameness of (using min cause L is a loss), some things that come to mind are

a) is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus .

b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)

(probably this list can be extended)

During training, the inner optimizer has the same behavior as the benign model: while it's still dumb it just doesn't know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

You're still assuming that you have a perfect consequentialist trapped in a box.

And sure, if you have an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, and if not in training does some sort of dangerous consequentialist thing, then that AI will do well in the loss function and end up doing some sort of dangerous consequentialist thing once deployed.

But that's not specific to doing some sort of dangerous consequentialist thing. If you've got an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, but otherwise throws null pointer exceptions, then that AI will also do well in the loss function but end up throwing null pointer exceptions once deployed. Or if you've got an AI that accurately guesses whether it's in training or not, and if in training performs predictions as intended, but otherwise shows a single image of a paperclip, then again you have an AI that does well in the loss function but ends up throwing null pointer exceptions once deployed.

The magical step we're missing is, why would we end up with a perfect consequentialist in a box? That seems like a highly specific hypothesis for what the predictor would do. And if I try to reason about it mechanistically, it doesn't seem like the standard ways AI gets made, i.e. by gradient descent, would generate that.

Because with gradient descent, you try a bunch of AIs that partly work, and then move in the direction that works better. And so with gradient descent, before you have a perfect consequentialist that can accurately predict whether it's in training, you're going to have an imperfect consequentialist that cannot accurately predict whether it's in training. And this might sometimes accidentally decide that it's not in training, and output a prediction that's "intended" to control the world at the cost of some marginal prediction accuracy, and then the gradient is going to notice that something is wrong and is going to turn down the consequentialist. (And yes, this would also encourage deception, but come on, what's easier - "don't do advanced planning for how to modify the world and use this to shift your predictions" or "do advanced planning for how to do advanced planning for how to modify the world using your predictions without getting caught"?)

Re: tameness of  (using min cause L is a loss), some things that come to mind are

a)  is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus .

This works as an optimum for , but here you then have to go for another layer of analysis.  measures the degree to which something is a fix point for the training equation, but obviously only a stable fixed point would actually be reached during the training process. So that raises the question, is the optimum you propose here a stable fixed point?

Let's consider some strategy that is almost perfectly what you describe. Its previous predictions have caused enough chaos to force the variable it has to predict to be almost random - but not quite. It can now spend its marginal resources on two things:

  • Introduce even more chaos, likely at the expense of immediate predictive power
  • Predict the little bit of signal, likely at the expense of being unable to make as much chaos

Due to the myopia, gradient descent will favor the latter and completely ignore the former. But the latter moves it away from the fixed point, while the former moves it towards the fixed point. So your proposed fixed point is unstable.

b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)

The other models would also get access to this compute, that's sort of the point of the model.

[-]TLW2y20

One way to fix this might be to make sure that there is only one set of oracles, and that this set is built such that they assume the null prediction (and no manipulation elsewhere etc etc) from all the oracles in the set.

Can this work? Consider if Oracle2 hasn't yet been built.

Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn't help the oracles coordinate as far as I can see).

[-]TLW2y10

I'm admittedly somewhat out of my depth with acausal cooperation. Let me flesh this out a bit.

Oracle 1 finds a future that allows an Oracle 2 (that does not fall inside the same set) to be built. Oracle 1 outputs predictions that both fall under said constraint, and that maximize return for Oracle2. Oracle 2 in turn outputs predictions that maximize return for Oracle1.

Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).

If it's not myopic you're right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.

[-]TLW2y20

Even if the oracle is myopic, there are still potential failure modes of the form "start outputting answer; [wait long enough for Oracle 2 to be built and take over the world]; finish outputting answer", no?

(I suppose you can partially counter this by ensuring outputs are atomic, but relying on no-one inspecting a partial output to prevent an apocalypse seems failure-prone. Also, given that I thought of this failure mode immediately, I'd be worried that there are other more subtle failure modes still lurking.)

Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it's likely there are a bunch more possible failure modes here.

(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)

IMO, the obvious problem is that the counterfactual oracle (by your definition) is useless. Null prediction != no information. Predicted (simulated) operators of oracle will know that they are simulated, because the oracle returned the null prediction. It might completely invalidate the prediction. I think people who know about being simulated behave pretty differently.

Fix: Maybe it's possible to use the counterfactuals not with null prediction, but with specific actions of operators, described to the oracle for this particular case. However, I think this requires some additional oracle's capabilities, which can make it more dangerous. I don't know.

UPD: And if we have the counterfactual oracle with my proposed fix, and then we use it like "make the list of possible actions, make some utility metric, ask the oracle about the value of a metric for each action in the list, execute action with best predicted results"... o-ops, system "oracle + operators" is Expected Utility Maximizer now! May be not the most dangerous, but only until the operators figure out how to delegate creation of the list to the oracle.