Alignment remains a hard, unsolved problem

[-]jake_mendel1h100

Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.

but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale

I'm curious why you think this? It seems like there's some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I'm wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?

[-]Jozdien6m20

Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.

With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).

[-]StanislavKrym35m10

Training against cognitive oversight's signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain's strategy from AI-2027's Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.

[-]Adrià Garriga-alonso4hΩ47-13

Super cool that you wrote your case for alignment being difficult, thank you! Strong upvoted.

To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we'd be in terrible shape, but current levels of investment seem to be working.

I have specific disagreements with the evidence for specific parts, but let me also outline a general worldview difference. I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won't happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.

Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.

Here are my concrete disagreements:

Outer alignment

Paraphrasing, your position here is that we don't have models that are smarter than humans yet, so we can't test whether our alignment techniques scale up.

We don't have models that are reliably smarter yet, that's true. But we have models that are pretty smart and very aligned. We can use them to monitor the next generation of models only slightly smarter than them.

More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It's what you term "one-shotting alignment" every time, but the extent to which we have to do so is so small that I think it will basically work. It's like induction, and we know the base case works because we have Opus 3.

Does the 'induction step' actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model. So the majority of evidence we have points to iterated distillation and amplification working.

On top of that, we don't even need much human data to align models nowadays. The state of the art seems to be constitutional approaches: basically uses prompts to the concept of goodness in the model. This works remarkably well (it sounded crazy to me at first) and it must work only because the pre-training prior has a huge, well-specified concept for good

And we might not even have to one-shot alignment. Probes were incredibly successful at detecting deception, in the the sleeper agents organism. They're even resistant to training against them. SAEs are not working as well as people wanted them to, but they're working pretty well. We keep making interpretability progress.

Inner alignment

Good that you were early on predicting pre-trained-only models would be unlikely to be mesa-optimizers.

Misaligned personas

the version of [misaligned personas] that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned!

I don't think this one will get much harder. Misaligned personas come from the pre-training distribution and are human-level. It's true that the model has a guess about what a superintelligence would do (if you ask it) but 'behaving like a misaligned superintelligence' is not present in the pre-training data in any meaningful quantities. You'd have to apply a lot of selection to even get to those guesses, and they're more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won't generalize that way).

So misaligned personas will probably act badly in ways we can verify. Though I suppose the consequences of a misaligned persona can get to higher (but manageable) stakes.

Long-horizon RL

I disagree with the last two steps of your reasoning chain:

Most coherent agents with goals in the world over the long term want to fake alignment, so that they can preserve their current goals through to deployment.

Most mathematically possible agents do, but that's not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It's not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won't stop them. Because the models don't reason natively in utility functions, they reason in human prior.

And if they are already pretty aligned before they reach that stage of intelligence (remember, we can just remove misaligned personas etc. earlier in training, before convergence), then they're unlikely to want to start faking alignment in the first place.

Once a model is faking alignment, there’s no outcome-based optimization pressure changing its goals, so it can stay (or drift to be) arbitrarily misaligned.

True implication, but a false premise. We don't just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what's in the CoT of other LLMs.

Altogether, these paint a much less risky picture. You're gonna need a LOT of selection to escape the benign prior with all these obstacles. Likely more selection than we'll ever have (not in total, but because RL only selects a little for these things; it's just that in the previous ontology it compounds).

What to do to solve alignment

The takes in 'what to do to solve alignment', I think they're reasonable. I believe interpretability and model organisms are the more tractable and useful ones, so I'm pleased you listed them first.

I would add robustifying and training against probes as a particularly exciting direction, which isn't strictly a subset of interpretability (you're not trying to reverse-engineer the model).

Conclusion

I disagree that we have gotten no or little evidence about the difficult parts of the alignment problem. Through the actual technique used to construct today's AGIs, we have observed that intelligence doesn't always look like ruthless optimization, even when it is artificial. It looks human-like, or more accurately like multitudes of humans. This was a prediction of the pre-2020 doomer model that has failed, and ought to decrease our confidence in it.

Selection for goal-directed agents in long contexts will make agents more optimizer-y. But how much more? I think not enough to escape the huge size of the goodness target in the prior, plus the previous generation's aligned models, plus stamping out the human-level misaligned personas, plus the probes, plus the chain of thought monitors, et cetera.

[-]TsviBT5hΩ34-6

My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that).

Are you skeptical of PvNP-level due to priors or due to evidence? Why those priors / what evidence?

(I think alignment is pretty likely to be much harder than PvNP. Mainly this is because alignment is very very difficult. (Though also note that PvNP has a maybe-possibly-workable approach, https://en.wikipedia.org/wiki/Geometric_complexity_theory, which its creator states might take a mere one century, though I presume that's not a serious specific estimate.))

[-]J Bostock14m20

The idea of Control is just to make it as difficult as possible for a misaligned model to actually sabotage our alignment research, such that even if a model is fairly misaligned, we can still get useful alignment research out of it, at least for some range of capabilities.

Maybe in theory, but most control research doesn't try to evaluate this! A typical monitoring protocol is considered to be a "win" for the blue team if the red team gets caught, even if that means that no useful work would be done by a misaligned model. The "usefulness" reported in papers is how much work can be gotten out of an aligned model.

I think of Control as a second layer of defense

The resulting "caught" model can be retrained, but this mostly looks like using the control protocol to get top-quality training data, rather than using the control protocol to actually control the model. So control is not an orthogonal second layer on top of the {training data+training protocol+model generalization} complex.

Or if you're using control protocols as an eval (in the catch-and-shut-down regime) you still need to be able to (sometimes) train an aligned model in the first place, or else your eval will just keep telling you your model is misaligned forever.

(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I'm not convinced by them)

LESSWRONG
LW

LESSWRONG
LW

92

Alignment remains a hard, unsolved problem

92

Ω 42

92

Ω 42

Outer alignment

Inner alignment

Misaligned personas

Long-horizon RL

What to do to solve alignment

Conclusion

What makes alignment hard?

Outer alignment

Inner alignment

Misalignment from pre-training

Misaligned personas

Misalignment from long-horizon RL

What should we be doing?