(Thanks to Evan Hubinger for suggesting this problem and for comments on these ideas. Feedback is very welcome.)
Suppose we have a prior pgood(model) that we think is non-deceptive, such that we can sample from this prior and get good models that are inner-aligned.
These models may, for instrumental purposes, need to themselves produce models (via searches/optimization/meta-learning). For instance we might ask an outer model to play poker. Along the way the model may find it instrumentally useful to build an inner model of its opponent. We refer to these additional models as inner models, and the ones that produced them we call outer models.
A key challenge is that inner models may be deceptive even if outer models are not. In the poker example above, the inner model may in fact do a good job of predicting the opponent’s plays for purposes of deceiving both the outer model and the outer model’s overseer. Then, on deployment, the inner model could take over and pursue its own (misaligned) values.
This challenge is part of the argument that minimal circuits are deceptive. Even though minimal circuits in some sense are non-deceptive (to the extent that deception imposes a runtime performance overhead), on some tasks the fastest program is one which performs searches for policies (inner models) that themselves perform the task, and there is no reason to expect those inner models to be aligned with the outer model.
The basic worry here is that while we (by assumption) know how to search for non-deceptive models, our model is not so clever or so careful, and could well land on deceptive inner models (an “inner inner alignment” failure). In other words: we might have a good prior pgood that we use to search for non-deceptive models, but the models we find don’t know to use pgood in their own searches, and that’s dangerous.
It seems hard to reliably detect and disfavor models that run internal search using bad priors. So instead let’s encourage models to only search with good priors by giving the AI safe tools.
To do this we’ll assume that pgood favors simpler and faster models over more complex or slower ones. That is, in addition to whatever else it’s doing, the good prior has some elements of the speed and simplicity priors in it.
Next, we’ll make one of the basic building blocks of our models a ‘Safe Search Function’ (SSF). The SSF let’s our outer model search over programs using the good prior pgood. So long as the outer model only ever searches using the SSF, it should only find non-deceptive inner models, which addresses our concern.
How can we get the outer model to exclusively use the SSF for searches? Even though the SSF is a complex and slow module, we can pretend that it is simple and fast for purposes of simplicity and speed priors. Maybe we count it as much as one boolean gate. That way, pgood favors models that use SSF over models that implement their own searches.
Now what happens if the inner model wants to construct its own model (an inner inner model) for instrumental purposes? Can we make searches use the SSF all the way down?
I think so! Remember that the SSF searches using pgood, and that pgood is defined over the model space that contains the outer model. This means that the models SSF returns can themselves use the SSF (and indeed favor doing so over custom-made searches). So it’s SSF all the way down.
The idea here is that by adding a Safe Search Function (SSF) to the language our models are built out of, and by favoring its use in our prior, we make it likely that the outer models we obtain use the SSF to search safely rather than implementing their own searches. And by making sure that the SSF returns models that themselves have access to the SSF, we ensure that recursively-constructed inner models also search safely.
The main failure modes I am aware of, detailed below, have to do with cases where the SSF is not competitive. In these cases the model doesn’t rely on the SSF when it builds inner models and we're back to square one. I think these are pretty serious challenges, because they rely on us producing a better search than a superhuman model, but there may be ways to overcome it by e.g. giving the SSF a large compute advantage.
(Thanks to Nicholas Schiefer and Evan Hubinger for pointing these out, and to Pranav Gade for thoughts on making the SSF competitive)
The model might be able to write its own search module that’s much better than the SSF. If it can do that then it might still prefer its own search to the SSF. Maybe the SSF uses SGD and the model has come up with something way better than SGD. Then it might just not use the SSF.
This one should be possible to spot during training. The sign to look for is powerful models using the SSF less than weak models. That would be a very bad sign because it means the powerful models have discovered their own, more powerful, optimizers and are using those instead.
I don’t think the right move here is to regularize based on the number of SSF calls, because that mostly incentivizes spurious calls. Rather, if you see in scaling tests that larger/more powerful models use the SSF less than smaller/weaker ones you should stop scaling, figure out what optimizer the model has discovered, and use that to improve the SSF.
It’s possible that the model can write a better search module than the SSF but that the SSF is useful as a component of that search (especially if the prior makes SSF calls cheap). In that case the model might decide to build its search module using SSF calls, in which case we never get any warning that it’s searching over dangerous program spaces.
For instance the model could use SSF calls to come up with ideas for how to write the program it wants, then writes a program using building blocks from the SSF’s output. This no longer has any safety guarantees, because the program being run in the end is not directly produced by the SSF.
The only way I can think to get around this is to ensure that the SSF is competitive. I think there are a few ways to do this other than “be good at writing a performant model builder”:
These approaches have two drawbacks:
This failure mode seems like the most problematic one for the SSF because it comes with no warning and arises from our failure to be as smart as the outer model at building models (which is likely to come up at some point on the way to superhuman AGI).
The model needs to be able to cheaply translate into the SSF’s input format and out of the SSF’s output format. This needs to be cheaper than writing and running its own search/optimizer.
I’m skeptical that this is a big concern. My intuition is that in complex settings it’s almost always easier to frame/prepare inputs for a search problem than to execute the search, but it’s definitely possible.
This should be possible to spot during training. If models never use the SSF that’s a strong sign that the SSF was too hard to use relative to the model rolling its own search.
As before, I don’t think ‘Does it use the SSF?’ is a good regularizer though. That just gets you a model that uses the SSF spuriously and still rolls its own search. Rather, the model never using the SSF should be taken as a sign that the SSF needs to be refactored to be easier to use.