Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TLDR: We might want to use some sort of oversight techniques to avoid inner misalignment failures. Models will be too large and complicated to be understandable by a human, so we will use models to oversee models (or help humans oversee models). In many proposals this overseer model is an ‘amplified’ version of the overseen model. Ideally you do this oversight throughout all of training so that the model never becomes even slightly misaligned without you catching it. 
You can’t oversee on a close to initialized model because it’s just a random soup of tensors. You also can’t use this close to initialized model to help you do oversight because it’s too dumb. 
We will probably need to do some amount of pretraining to make our models good enough to be interpreted and also good enough to help with this interpreting. We need to ensure that this pretraining doesn’t make the model capably misaligned.


When we train powerful AI models, we want them to be both outer aligned and inner aligned; that is trained on the correct objective and for them to also properly learn that objective. Many proposals for achieving both outer and inner alignment look like an outer alignment proposal with some kind of oversight strapped on to deal with the inner alignment. Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.

Examples of oversight techniques include:

  • Transparency tools (either used by a human, an AI, or a human assisted by an AI)
  • Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
  • Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Oversight loss

We can use these oversight techniques to add an ‘oversight loss’ term to the loss function which will hopefully steer the training towards aligned models and away from misaligned ones. My model here is that we want to be doing oversight very close to the start of training to prevent it from ever becoming misaligned. If we start doing oversight too late, then the model may already be deceptively misaligned and then our oversight tools are much less likely to work (either because the model can obscure its cognition or the oversight tools are themselves misaligned). I think of this as steering the training process away from the ‘deceptive regions’ of parameter space (parameters which correspond to deceptive models), without ever having to enter the dangerous regions.

Alternatively, rather than deception, we can think about regions where the model ‘behaves unacceptably’ on a some reasonable inputs. If taking a gradient descent step in a direction would increase the size of the set of reasonable inputs for which the model behaves unacceptably, then hopefully the oversight loss would provide some pressure away from stepping in that direction.

Actually doing oversight

I expect powerful AI models will be much too large and complicated for a human to understand/oversee them alone. So we will require help from other models to help with this task; this could be via models assisting the humans, or we could entirely hand the oversight process off to an oversight model.

Here I’m taking very prosaic view: the model being overseen is a large language model, and the overseer model also as an LLM but trained to output some ‘acceptability score’ rather than generate text.

In many proposals this oversight model is some amplified version of the model being overseen. This should hopefully make the job of the overseer easier, as it implicitly has access to all the knowledge and cognition of the model it is overseeing. This could also be implemented as having the overseer be an extra head on the original model.

You can’t oversee models close to initialization

We can’t really do oversight on a model at the very start of training, because it will basically just be a random soup of tensors. We can’t use interpretability tools to look for coherent features or algorithms because there won’t be any. We can’t use adversarial inputs to test for bad behavior because all the model’s output will just be a random string tokens. The model needs to have reached some minimal level of capability for oversight (and even ‘alignment’) to make sense. Even if a human is more ‘capable’ than an initialized model, this doesn’t at all mean that the human can do oversight on it.

Models close to initialization can’t help you do oversight

A dumb model at the start of training obviously can’t help us to do oversight. At a minimum the model needs to be able to understand and generate coherent text, and also to have an adequate understanding of what is ‘acceptable’. These both seem like the sort of capabilities which come from standard pretraining. Obviously pretraining on text helps a model learn to generate text. And pretraining on a lot of text seems like the kind of thing that would give an AI a model of what humans would find ‘acceptable’ (not that this implies that the AI cares what the human wants, only that it has a reasonable model of this). We will also then need to do something further to train the model to be a useful overseer.

So it seems like we will need to do some minimal amount of pretraining to allow the model to be capable enough to help with oversight (and be overseen). This takes us away from the ideal case where we were simultaneously training and using oversight from the beginning of training. For these kind of oversight techniques to work, we need to ensure that pretraining doesn’t itself introduce misalignment.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 7:41 PM

It strikes me that the kind of self-supervision you describe is a suspiciously similar to trying to incorporate meta-preferences in the outer objective by self-modeling. When the model understands humans differently, it changes its notion of what it is to be misaligned or deceptive, which gets used to form a loss term against that kind of behavior, which then impacts how it understands humans.

This analogy:

  • makes me significantly more optimistic about interpretability inside the training loop.
  • makes me slightly more pessimistic about meta-preferences incorporated as fine-tuning based on self-modeling.
  • clarifies preference-related arguments for why getting the former to work means fulfilling more desiderata than current basic schemes do.

Maybe there's some way of doing oversight that verifies that it is either a random soup or some super sophisticated hidden attack? And maybe we can rule out the later if the model hasn't undergone nearly enough training yet?

IMO scaling laws make this actually pretty easy. The curves are smooth, the relationship between capabilities of a smaller model and a less-trained larger model seem fairly reliable, and obviously you can just use your previous or smaller model to bootstrap and/or crosscheck the new overseer. Bootstrapping trust at all is hard to ground out (humans with interpretability tools?), and I expect oversight to be difficult, but I don't expect doing it from the start of training to present any particular challenge.

(opinions my own, you know the drill)