Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

So a few posts ago I looked at the problem of not being able to anticipate all consequences of an action as being related to deceptive mesaoptimization, but also outer alignment too. This post digs more into some of the things I only touched on briefly in that post.

Editor’s note: I’m experimenting with having a lower quality threshold for just posting things even while I’m still confused and unconfident about my conclusions, but with this disclaimer at the top. Thanks to Kyle and AI_WAIFU for discussions.

Last time, we narrowed down the problem to [certain actions we can take now] leading to [a world such that if [RSA2048 or some other thing we can’t simulate] happens, then [bad things will happen]] (brackets added for easier parsing of this sentence). This is because we can't plan for a future where RSA-2048 is solved but if we can already see RSA2048 in the world, then we can plan forward from that and see when things blow up. 

So if we could somehow figure out how to ensure that no action puts us into an RSA2048-vulnerable world, then we can prevent this failure case. Unfortunately, knowing precisely whether a world is vulnerable would essentially require us to simulate RSA2048, which gets us nowhere. However, one observation is we don't have to know precisely, we just need to know whether a world could be vulnerable — like for example if the agent removes its own safeguard, then we wouldn’t want to allow that even if the model is in fact safe and wouldn’t destroy the world when it sees RSA2048.

Telling whether side effects are dangerous is hard 

One thing we might think to do is instead simulate putting a dummy AI in the “do rollouts to find out if these actions destroy the world” box (the word “box” here has no relation to the AI box experiment), such that this dummy AI tries to kill everyone when it sees something we can actually simulate, like the string "chariots chariots" or something. Then, we can execute the action in the simulation, and then simulate seeing that string, and see if the box successfully prevented the action from being executed in the real world. This way, we can test whether the box is broken by the action, right?

Unfortunately there's a huge counterexample to this. So suppose the AI proposes building a really complicated machine in the world that we don't understand. We can still simulate the consequences of having this machine exist, of course, but we can't simulate it past the RSA2048 barrier, and so this machine could do evil stuff after seeing RSA2048 and we'd never know. The more general lesson from this is that if the model proposes doing something with side effects in the real world that we can't understand, that allows it to build up stuff "outside of the box" and then that stuff could do bad things. If we try to understand everything the model does, we end up bottlenecked by the whole "doing science" thing again, and if we try to bring the things the model does back inside the box (i.e only allow the machine it builds to also propose actions within the box), then you end up never being able to modify the real world at all because any changes to the real world could actually be setting up something that blows up after RSA2048, which is totally useless. 

So, in general, it seems hard to know whether an action sets up something evil in the world that only activates after RSA2048. 

It’s (partial) ELK reduction time, my dudes

There is a partial ELK reduction here in that if we can solve ELK then we can ask the model whether it thinks its actions are putting the world into a vulnerable state. It's worth noting that this wouldn't work if we just asked the model whether RSA2048 would cause it to kill everyone, because a) the model could always just remove our outer loop with this whole ELK thing before ever seeing RSA2048, and trying to prevent that collapses to just asking it whether it puts the world into a vulnerable state, and b) this actually depends on a slightly stronger version of ELK, where not only do we generalize to action sequences we can't see, but also to questions we can't see (because we can never simulate question answer pairs of the form "if you see RSA2048, then do you do x?", so the prior has to do all the work here).

However, this isn’t necessarily enough to prevent putting the world into a vulnerable state. For instance, suppose that it’s actually very easy to stumble on actions that put the world into a vulnerable state in ways humans can’t understand. Then, even if you just have a weak optimizer (say, just best-of-n based on some value model) that isn’t really maliciously “trying” to create a vulnerable state, and/or doesn't even realize that its actions will lead to a vulnerable state, you still have a high chance of ending up in one anyways. (Update: I no longer endorse this, because to be able to do best-of-n, the value model would need to be able to model action consequences, which brings us back to ELK. In general, it seems difficult/impossible to reliably take actions which optimize for RSA2048-vulnerability without having some model of whether actions lead to RSA2048-vulnerability. Possible forthcoming post on this)

Stumbling into RSA2048-vulnerable states

One important thing here is that our abstraction where we pretend that RSA2048 is the only thing we can't model starts leaking here, because the more inaccurate our world model is, the easier it is to stumble into a vulnerable state, because there will simply be a lot more vulnerable states. So my intuition is that even though in these examples RSA2048-triggered optimization is a pretty unnatural thing to accidentally stumble upon, in practice this weak optimizer case matters a lot, because lots of things we might not be able to anticipate could trigger some optimizer we build in the world.

One reason we might expect it to be easy to stumble into these states is that lots of these states correspond to your AI building another AI to help it, which seems like a pretty convergently instrumental thing to do, and maybe your AI just fails at making the second AI inner aligned (but since the inner alignment failures happen behind the RSA2048 barrier, your world model rollouts think that this second AI is inner aligned!)

Another angle to look at the same thing is that this weak optimizer case is the case where even if the model isn’t “trying very hard” to get around your box, by default actions that are successful tend to be the ones that violate the integrity of the box. This framing makes weak optimizer failures seem pretty reasonable - the box is a pretty big constraint on the model, and it could get a lot more done a lot more efficiently without that constraint. 

What do we do?

Unfortunately I don't know if there's a solid solution to this problem. Our biggest advantage here is that the optimizer is restricted and so can't put the full weight of its optimization behind optimizing against us here, though even then you can still apply quite a bit of pressure. We might be able to put in this constraint with some kind of hack like a learned reward model to detect the creation of additional agents, which is obviously bottlenecked on science-doing, but because the optimization pressure isn't that strong it might be fine, and would probably allow us to stretch further than we could with just doing science on the original problem directly. And of course this all depends on ELK; if we can't solve that then this becomes even more of a problem.

I also feel like the idea of weak optimization behaving in ways that mislead us even if there’s no “intent” in some sense in the system is something that could be useful to think about more generally, not just in the context of mesaoptimization/world modeling. I think this is an important crux and I'm still confused about it.

New Comment