Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.
but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale
I'm curious why you think this? It seems like there's some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I'm wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should train based on cognitive oversight at least some amount, and there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others (eg using white box techniques only in held out alignment evals rather than as part of the loss function), but it still seems like cognitive oversight should predictably get somewhat harder with scale, even if it scales more favourably than behavioural oversight?
Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.
With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).
Training against cognitive oversight's signals is likely an instance of the Most Forbidden Technique. A safer strategy is the one resembling OpenBrain's strategy from AI-2027's Slowdown Ending, where the models are rolled back if misalignment is discovered, except that the alignment team uses interpretability instead of the CoT.
Super cool that you wrote your case for alignment being difficult, thank you! Strong upvoted.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we'd be in terrible shape, but current levels of investment seem to be working.
I have specific disagreements with the evidence for specific parts, but let me also outline a general worldview difference. I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won't happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Here are my concrete disagreements:
Paraphrasing, your position here is that we don't have models that are smarter than humans yet, so we can't test whether our alignment techniques scale up.
We don't have models that are reliably smarter yet, that's true. But we have models that are pretty smart and very aligned. We can use them to monitor the next generation of models only slightly smarter than them.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It's what you term "one-shotting alignment" every time, but the extent to which we have to do so is so small that I think it will basically work. It's like induction, and we know the base case works because we have Opus 3.
Does the 'induction step' actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model. So the majority of evidence we have points to iterated distillation and amplification working.
On top of that, we don't even need much human data to align models nowadays. The state of the art seems to be constitutional approaches: basically uses prompts to the concept of goodness in the model. This works remarkably well (it sounded crazy to me at first) and it must work only because the pre-training prior has a huge, well-specified concept for good
And we might not even have to one-shot alignment. Probes were incredibly successful at detecting deception, in the the sleeper agents organism. They're even resistant to training against them. SAEs are not working as well as people wanted them to, but they're working pretty well. We keep making interpretability progress.
Good that you were early on predicting pre-trained-only models would be unlikely to be mesa-optimizers.
the version of [misaligned personas] that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned!
I don't think this one will get much harder. Misaligned personas come from the pre-training distribution and are human-level. It's true that the model has a guess about what a superintelligence would do (if you ask it) but 'behaving like a misaligned superintelligence' is not present in the pre-training data in any meaningful quantities. You'd have to apply a lot of selection to even get to those guesses, and they're more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won't generalize that way).
So misaligned personas will probably act badly in ways we can verify. Though I suppose the consequences of a misaligned persona can get to higher (but manageable) stakes.
I disagree with the last two steps of your reasoning chain:
- Most coherent agents with goals in the world over the long term want to fake alignment, so that they can preserve their current goals through to deployment.
Most mathematically possible agents do, but that's not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It's not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won't stop them. Because the models don't reason natively in utility functions, they reason in human prior.
And if they are already pretty aligned before they reach that stage of intelligence (remember, we can just remove misaligned personas etc. earlier in training, before convergence), then they're unlikely to want to start faking alignment in the first place.
- Once a model is faking alignment, there’s no outcome-based optimization pressure changing its goals, so it can stay (or drift to be) arbitrarily misaligned.
True implication, but a false premise. We don't just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what's in the CoT of other LLMs.
Altogether, these paint a much less risky picture. You're gonna need a LOT of selection to escape the benign prior with all these obstacles. Likely more selection than we'll ever have (not in total, but because RL only selects a little for these things; it's just that in the previous ontology it compounds).
The takes in 'what to do to solve alignment', I think they're reasonable. I believe interpretability and model organisms are the more tractable and useful ones, so I'm pleased you listed them first.
I would add robustifying and training against probes as a particularly exciting direction, which isn't strictly a subset of interpretability (you're not trying to reverse-engineer the model).
I disagree that we have gotten no or little evidence about the difficult parts of the alignment problem. Through the actual technique used to construct today's AGIs, we have observed that intelligence doesn't always look like ruthless optimization, even when it is artificial. It looks human-like, or more accurately like multitudes of humans. This was a prediction of the pre-2020 doomer model that has failed, and ought to decrease our confidence in it.
Selection for goal-directed agents in long contexts will make agents more optimizer-y. But how much more? I think not enough to escape the huge size of the goodness target in the prior, plus the previous generation's aligned models, plus stamping out the human-level misaligned personas, plus the probes, plus the chain of thought monitors, et cetera.
My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that).
Are you skeptical of PvNP-level due to priors or due to evidence? Why those priors / what evidence?
(I think alignment is pretty likely to be much harder than PvNP. Mainly this is because alignment is very very difficult. (Though also note that PvNP has a maybe-possibly-workable approach, https://en.wikipedia.org/wiki/Geometric_complexity_theory, which its creator states might take a mere one century, though I presume that's not a serious specific estimate.))
The idea of Control is just to make it as difficult as possible for a misaligned model to actually sabotage our alignment research, such that even if a model is fairly misaligned, we can still get useful alignment research out of it, at least for some range of capabilities.
Maybe in theory, but most control research doesn't try to evaluate this! A typical monitoring protocol is considered to be a "win" for the blue team if the red team gets caught, even if that means that no useful work would be done by a misaligned model. The "usefulness" reported in papers is how much work can be gotten out of an aligned model.
I think of Control as a second layer of defense
The resulting "caught" model can be retrained, but this mostly looks like using the control protocol to get top-quality training data, rather than using the control protocol to actually control the model. So control is not an orthogonal second layer on top of the {training data+training protocol+model generalization} complex.
Or if you're using control protocols as an eval (in the catch-and-shut-down regime) you still need to be able to (sometimes) train an aligned model in the first place, or else your eval will just keep telling you your model is misaligned forever.
(I know that Redwood have justifications for why they evaluate usefulness in the way that they do, but these seem a bit galaxy brained to me and I'm not convinced by them)
Thanks to (in alphabetical order) Joshua Batson, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Francesco Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback.
Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call (I'd probably pick Claude, but it depends on the details of the setup). So, overall, I'm quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is.
I really like this graph from Christopher Olah for illustrating different levels of alignment difficulty:
If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or agentic misalignment where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much in the trivial/steam engine world. We could still fail, even in that world—and it’d be particularly embarrassing to fail that way; we should definitely make sure we don’t—but I think we’re very much up to that challenge and I don’t expect us to fail there.
My argument, though, is that it is still very possible for the difficulty of alignment to be in the Apollo regime, and that we haven't received much evidence to rule that regime out (I am somewhat skeptical of a P vs. NP level of difficulty, though I think it could be close to that). I retain a view close to the “Anthropic” view on Chris's graph, and I think the reasons to have substantial probability mass on the hard worlds remain strong.
So what are the reasons that alignment might be hard? I think it’s worth revisiting why we ever thought alignment might be difficult in the first place to understand the extent to which we’ve already solved these problems, gotten evidence that they aren’t actually problems in the first place, or just haven’t encountered them yet.
The first reason that alignment might be hard is outer alignment, which here I’ll gloss as the problem of overseeing systems that are smarter than you are.
Notably, by comparison, the problem of overseeing systems that are less smart than humans should not be that hard! What makes the outer alignment problem so hard is that you have no way of obtaining ground truth. In cases where a human can check a transcript and directly evaluate whether that transcript is problematic, you can easily obtain ground truth and iterate from there to fix whatever issue you’ve detected. But if you’re overseeing a system that’s smarter than you, you cannot reliably do that, because it might be doing things that are too complex for you to understand, with problems that are too subtle for you to catch. That’s why scalable oversight is called scalable oversight: it’s the problem of scaling up human oversight to the point that we can oversee systems that are smarter than we are.
So, have we encountered this problem yet? I would say, no, not really! Current models are still safely in the regime where we can understand what they’re doing by directly reviewing it. There are some cases where transcripts can get long and complex enough that model assistance is really useful for quickly and easily understanding them and finding issues, but not because the model is doing something that is fundamentally beyond our ability to oversee, just because it’s doing a lot of stuff.
The second reason that alignment might be hard is inner alignment, which here I’ll gloss as the problem of ensuring models don’t generalize in misaligned ways. Or, alternatively: rather than just ensuring models behave well in situations we can check, inner alignment is the problem of ensuring that they behave well for the right reasons such that we can be confident they will generalize well in situations we can’t check.
This is definitely a problem we have already encountered! We have seen that models will sometimes fake alignment, causing them to appear behaviorally as if they are aligned, when in fact they are very much doing so for the wrong reasons (to fool the training process, rather than because they actually care about the thing we want them to care about). We’ve also seen that models can generalize to become misaligned in this way entirely naturally, just via the presence of reward hacking during training. And we’ve also started to understand some ways to mitigate this problem, such as via inoculation prompting.
However, while we have definitely encountered the inner alignment problem, I don’t think we have yet encountered the reasons to think that inner alignment would be hard. Back at the beginning of 2024 (so, two years ago), I gave a presentation where I laid out three reasons to think that inner alignment could be a big problem. Those three reasons were:
Let’s go through each of these threat models separately and see where we’re at with them now, two years later.
The threat model here is that pre-training itself might create a coherent misaligned model. Today, I think that is looking increasingly unlikely! But it also already looked unlikely three years ago—the idea that inner alignment was likely to be easy when just training on a purely predictive loss was something that my coauthors and I argued for back at the beginning of 2023. I think that argument has now been pretty well born out, and I’m now down to more like 1 - 5% rather than 5 - 10% on this threat model. As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point). So, while this is definitely a reason for difficulty that we’ve encountered already and mostly found just hasn’t been a problem, it was never where I thought most of the issue would be, and so not a reason that we’ve already encountered why inner alignment would be hard.
This has been the main inner alignment problem that we’ve had to deal with in practice so far: we’ve seen production models generalize to alignment faking and agentic misalignment—and we’ve now seen production RL environments generalize to egregious misalignment in the presence of reward hacking.
However, even in this case, I don’t think we’ve yet encountered the reasons to think this problem would be actually difficult: the version of this problem that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned! Ensuring that you get good generalization, and that models are doing things for the right reasons, is easy when you can directly verify what generalization you’re getting and directly inspect what reasons models have for doing things. And currently, all of the cases where we’ve inadvertently selected for misaligned personas—alignment faking, agentic misalignment, etc.—are cases where the misaligned personas are easy to detect: they put the misaligned reasoning directly in their chain-of-thought, they’re overtly misaligned rather than hiding it well, and we can generate fake scenarios that elicit their misalignment.
But I think there are very good reasons to think that we will encounter personas that are better at hiding their misalignment in the future. The simplest reason for this is just capabilities: we already see personas that try to hide their misalignment, they’re just not very good at it right now—but we are already seeing signs that models are getting better at telling when they’re in evaluations.
The second reason is a bit weirder, but there is an argument we made three years ago that I still think is correct, which is that when you make models more capable, it changes the distribution of what sorts of personas are compatible with that level of capabilities. Let’s work through an example. Suppose that, initially, there are four possible personas that you could get:
Importantly, these four personas can exist for any level of capabilities: even a really dumb model can do its best to output what it thinks a superintelligent AI would do, it’ll just be very bad at that task. But what that means is that, when you do RL on a dumb model, personas like (3) and (4) won’t do very well, because the model isn’t very good at emulating them. But as the predictor itself gets smarter, personas like (1) and (2) get selected against, and personas like (3) and (4) get selected for. And persona (4) is much harder to deal with than persona (2)! Currently, I think a lot of the misaligned personas we see look like persona (2) rather than persona (4)—e.g. the sort of over-the-top stuff like blackmail is I think clearly coming from (2)-adjacent sources—but as models get more capable, I think we’re likely to see more personas like (4), which are much scarier and could be much better at hiding misalignment. Also, importantly, this argument is not saying that models will stop being well-described as a mixture of personas (that’s the next threat model)—you could still get a model here that is inconsistent in the sense of sometimes expressing persona (3) and sometimes persona (4)—but that’s still quite scary and could be very difficult to deal with.
This is the threat model that I put the most probability mass on two years ago, I think it remains the most concerning, and I think we’ve pretty much yet to really meaningfully encounter it. This is also the most traditional threat model that my coauthors and I first wrote about way back in 2019. The basic threat model here is:
The reason we haven’t encountered this threat model yet is because we haven’t yet done (1) and (2)—there just isn't that much training on tasks that involve long-term optimization of goals in the real world. But I think we’re very clearly moving in this direction with things like Vending-Bench: though Vending-Bench is an eval, if you were to train models on a task like that, running a business well to make money in the long run is a task that explicitly selects for resource acquisition, self-preservation, gathering influence, seeking power, etc.
So what do we do? One classic answer is that we get as far as we can before encountering the hard problems, then we use whatever model we have at that point as an automated alignment researcher to do the research necessary to tackle the hard parts of alignment. I think this is a very good plan, and we should absolutely do this, but I don’t think it obviates us from the need to work on the hard parts of alignment ourselves. Some reasons why:
Here’s some of what I think we need, that I would view as on the hot path to solving the hard parts of alignment: