There’s a potential failure mode where RL (e.g. RLVR or otherwise) is necessary to get powerful capabilities. Right?
I for one don’t really care about whether the LLMs of May 2025 are aligned or not, because they’re not that capable. E.g. they would not be able to autonomously write a business plan and found a company and grow it to $1B/year of revenue. So something has to happen between now and then to make AI more capable. And I for one expect that “something” to involve RL, for better or worse (well, mostly worse). I’ve been saying that RL is necessary for powerful capabilities, i.e. (self)-supervised learning will only get you so far, since I think 2020, shortly after I got into AGI safety, and that prediction of mine is arguably being borne out in a small way by the rise of RLVR (and I personally expect a much bigger shift towards RL before we get superintelligence).
What’s your take on that? This post seems to only talk about RL in the context of alignment not capabilities, unless I missed it. I didn’t read the linked papers.
My concern is that, if you're using RL to train a frontier system that's human-level or above, for alignment or capabilities purposes, is that it will inevitably find ways to abuse flaws in out RL rating system. One exception might be if the RL is for some capability like reasoning to produce a proof that passes proof checking, where it might be possible to create a rating system that actually has no flaws to exploit. I don't see how we could do that for RL for alignment, however.
Right, but what I'm saying is that there's at least a possibility that RL is the only way to train a frontier system that's human-level or above.
In that case, if the alignment plan is "Well just don't use RL!", then that would be synonymous with "Well just don't build AGI at all, ever!". Right?
...And yeah sure, you can say that, but it would be misleading to call it a solution to inner alignment, if indeed that's the situation we're in.
Why would we have to use RL to do this? The problem of building a rater for RL closely resembles automating the labelling problem for preparing the dataset for SGD safety pretraining, except that for online RL the rater is harder: it has to run fast, it can't be human assisted, and it has to be able to cope with arbitrary adversarial shifts in the distribution being rated and do so well enough for it to not have exploitable flaws. A rater for (or at least attaching ratings to the episode set for) offline RL is less bad: it's an almost equivalent problem to labelling a dataset for SGD, just attaching a score rather than a binary classification. The primary difference is that for the security pretraining approach the behavior we're training into the model is a classifier that labels behavior either good or bad, so isn't prone to Goodharting when you run it and ask for output from just one of the two categories, whereas for offline RL we're training a policy that tries to maximize the goodness rating, so is prone to Goodharting when the gradient towards the very "best" behavior leads it outside the training distribution. (The reason the SGD-trained classifier is safe is closely related to the satisficing approach to avoid Goodhart's Law.) So from the rating and stability point of view online RL is more challenging than offline RL, which is more challenging than security pretraining SGD.
Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples? Why do you think there at least a possibility that RL could be the only way to train a frontier system that's human-level or above? I'm not currently seeing any potential advantage of RL — other than the fact it induces distribution shifts, during training for online RL, or after it for offline RL, so doesn't require us to already know the distribution we want: but these distribution shifts are exactly the source of its danger.
Let me give you a detailed presciption. For whatever RL training scheme you think we need, convert the rater for that to a satisficing binary classifier (classes: good enough vs not good enough behavior), and run it over large training set of episodes matching the distribution of data you want your model to produce. Do SGD pretraining from that, and condition the generation from the result on the "good" label. My claim is that the output will be functionally equivalent your RL trained model, but its behavior will be more predictable in advance from the training set since there are no inherent distribution shifts. For there to be possibility that RL could be the only way to train a frontier system that's human-level or above, either this would need to be false, or some aspect of the proposed input would need to not be computable/generatable for us, other than via the RL training process (whose output can clearly generate this). Which of these are you proposing might occur?
Let me give you a detailed presciption…
For example, people want AIs that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.
People are trying to build such AIs as we speak, and I don’t expect them to quit until they succeed (or until we all die from their attempt).
And it’s possible—if human brains (or groups of human brains) can do this, so can AI algorithms. But human brains involve (model-based) RL. It’s an open question whether there exists a non-RL algorithm that can also do that. (LLMs as of today obviously cannot.)
I think the issue here is: “some aspect of the proposed input would need to not be computable/generatable for us”.
If the business is supposed to be new and out-of-the-box and innovative, then how do you generate on-distribution data? It’s gonna be something that nobody has ever tried before; “out-of-distribution” is part of the problem description, right?
Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples?
Not all RL is “RL on [human] rated examples” in the way that you’re thinking of it. Jeff Bezos’s brain involves (model-based) RL, but it’s not like he tried millions of times to found millions of companies, and his brain gave a reward signal for the companies that grew to $1B/year revenue, and that’s how he wound up able to found and run Amazon. In fact Amazon was the first company he ever founded.
Over the course of my lifetime I’ve had a billion or so ideas pass through my head. My own brain RL system was labeling these ideas as good or bad (motivating or demotivating), and this has led to my learning over time to have more good ideas (“good” according to certain metrics in my own brain reward function). If a future AI was built like that, having a human hand-label the AI’s billion-or-so “thoughts” as good or bad would not be viable. (Futher discussion in §1.1 here). For one thing, there’s too many things to label. For another thing, the ideas-to-be-rated are inscrutable from the outside.
I’m also still curious how you think about RLVR. Companies are using RLVR right now to make their models better at math. Do you have thoughts on how they can make their models equally good at math without using RLVR, or any kind of RL, or anything functionally equivalent to RL?
Also, here’s a challenge which IMO requires RL [Update: oops, bad example, see Zack’s response]. I have just invented a chess variant, Steve-chess. It’s just like normal chess except that the rooks and bishops can only move up to four spaces at a time. I want to make a computer play that chess variant much better than any unassisted human ever will. I only want to spend a few person-years of R&D effort to make that happen (which rules out laborious hand-coding of strategy rules).
That’s the Steve-chess challenge. I can think of one way to solve the Steve-chess challenge: the AlphaZero approach. But that involves RL. Can you name any way to solve this same challenge without RL (or something functionally equivalent to RL)?
Can you name any way to solve [chess but with rooks and bishops not being able to move more than four squares at a time] without RL (or something functionally equivalent to RL)?
This isn't even hard. Just take a pre-2017 chess engine, and edit the rules code so that rooks and bishops can only move four spaces. You're probably already done: the core minimax search still works, α–β pruning still works, quiescence still works, &c. To be fair, the heuristic evaluation function won't be correct, but you could just ... make bishops and rooks be respectively worth 2.5 and 3.5 points instead of the traditional 3 and 5? Even if my guess at those point values is wrong, that should still be easily superhuman with 2017 algorithms on 2017 hardware. (Stockfish didn't incorporate neural networks until 2020.)
Incidentally, there are a great many variant versions of chess with different piece-move rules (collectively sometimes called "fairy chess"), and I think even quite a lot of collected games for some of the more popular rule variants. Training an AI to play many types of fairy chess, and even arbitrary new just-invented ones, might be an interesting project that covers some aspects of generalizing out-of-distribution and positive transfer. A suitably-edited-for-the-variant version of Stockfish makes a pretty strong baseline for this. Using AlphaZero per variant is another obvious baseline.
Hmm, you’re probably right.
But I think my point would have worked if I had suggested a modified version of Go rather than chess?
There's not a lot of scope for aligned/unaligned behavior in Go (or chess): it's a zero-sum game, so I don't see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.
I was trying to argue in favor of:
CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).
It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?
There are certainly things that it's easier to do with RL — whether it's ever an absolute requirement I'm less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that's the case I'm not familiar with the details — I'd love references to anything relevant to this, if anyone has them.
My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it's basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.
Suppose we lived in a spatially-finite universe with simple deterministic laws of physics that we have fully colonized, in which we can run a computation for any finite number of steps that we can specify. (For example, everyone agrees to hibernate until it's done.) Let's use it to play Go.
Run all ~2^2^33 programs ("contestants") that fit in a gigabyte against each other from all ~3^19^2 possible positions. Delete all contestants that use more than 2^2^2^2^2^100 CPU cycles on any one move. For every position from which some contestant wins every match, delete every contestant that doesn't win every match.
This enforces ~perfect play. Is it safe to pick a surviving contestant pseudorandomly? Not clearly: Consider the following reasonably-common kind of contestant.
What logical consequences might it prove? "1=1" for one, but that will count towards every O. "Source code C, given input I, produces output O." for another, but that's a pretty long one. If it would be the survivor in line to be pseudorandomly selected, most consequences of its decision are via the effects on our universe!
So if it predicts that it would be selected[1], it will output perfect play to survive, and then keep being consequentialist about any choice between two winning strategies - for example, it might spell out a message if we would watch the winner play, or it could steer our experiment's statistics to inspire a follow-up experiment that will, due to a bug, run forever rather than ever waking us up from hibernation.
Or by some tweaking of 2., if it assumes that it would be selected because otherwise the choice of O doesn't matter,
Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
The SGD safety pretraining equivalent would be to include that transcript in the pretraining dataset (or, since such data is very rare and useful/high quality, perhaps an entrepreneurship-specific fine-tuning dataset). So far, very similar. You would also (likely AI-assisted) look through all of the transcript, and if you located any portions where the behavior was less wise or less moral/aligned than the behavior we'd like to see from an aligned AI-entrepreneur, label that potion with <|unaligned|> tags (or whatever), and perhaps also supplement it with commentary on subject like why it is less wise/moral/aligned than the standards for an aligned AI, what should have been done instead, and speculations around the likely results of those counterfactual actions.
Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.
I don’t think this approach would lead to an AI that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.
…So I expect that future AI programmers will keep trying different approaches until they succeed via some other approach.
And such “other approaches” certainly exist—for example, Jeff Bezos’s brain was able to found Amazon without training on any such dataset, right?
(Such datasets don’t exist anyway, and can’t exist, since human founders can’t write down every one of their thoughts, there are too many of them and they are not generally formulated in English.)
It's unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO's email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical.
In practice, I suspect we'll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we'll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn't been specifically trained on.
"Algorithm 1: Safe Beam Search with Harmfulness Filtering" relies on a classifier of whether the sequence came from the training subdataset tagged with tau, or the training subdataset not tagged with tau. What happens when the sequence lies in neither distribution, such as because the AI is considering a plan that nobody has ever thought of?
The labeling used is for harmful material. The underlying logic here is that things are either harmful, or they're not. Higher capability LLMs with complex world models are generally significantly more successful at extrapolating tasks like this out-of-distribution that a basic classifier ML model would be, but it's not going to be perfect. If you come up with something that's way out in left field, the LLM may no longer be able to accurately classify it as harmful or not. The same is of course also true for humans, or any agent: it's an inherent challenge of of Bayesian learning — without enough evidence, in areas where extrapolating from the hypotheses you've learnt doesn't suffice, you don't yet know the answer. So you should be cautious moving out-of-distribution, especially far out of distribution in new ways that you've never seen before. But then, as everyone knows (including a capable AI based on an LLM), that's also true for many other reasons: if you don't know what you're doing, there are many dangers. A sensible heuristic would be to assume by default that going far out-of-distribution is harmful until proven otherwise — one way to try to implement this would be stating, motivating, and explaining it, and giving approving examples of other AIs showing caution in this situation, many times through in the pretraining set.
How could we possibly make any AI that wouldn't have this failure mode?
Presumably in some domains its capabilities will generalize better OOD than its tau-classifier (and vice versa). You could try to have it err in the direction of tau in such cases, though neither paper seems to gesture at this.
Now whether things are harmful depends on the capability level. For example, you might trust an AI to send an email to a politician arguing for climate change or peacemaking if it's human-level, but not if it's smart enough to tell which second-order effects will dominate, such as inoculating the politician against the arguments, or distracting them from their work on AI regulation, or maneuvering them into drama with another faction.
You could try to put the AI's capabilities in context, if you know them, so things can be either-harmful-or-not again, though neither paper seems to gesture at this.
Such problems are characteristic of attempts to build an aligned system out of parts that are not, by themselves, aligned; they will search for ways to bypass your system. We could possibly figure out how to build aligned parts.
This is an interesting analysis. I think even asking whether inner alignment is solved is an overstatement, but this type of proposal is worth some serious consideration. It might very well be part of our first attempts at aligning real AGI. And those attempts don't seem obviously doomed - so figuring out if they are seems like a high priority.
The short version:
Maybe this works better than RL to avoid inner misalignment - if you can get people to do this instead of RL. However, supervised learning may still create inner misalignment if used for the same purposes as RL would be. There's an excellent LW post making a technical argument that the two are computationally equivalent in important ways if they're used for similar purposes - but I can't find it in my notes or by searching.
Even if it does work for to align an LLM, we're not guaranteed an aligned agent built around that LLM. Once it starts to learn and reflect, its effective values and goals will change unpredictably. You can put in initial behavioral tendencies that imply values and goals, but learning includes learning new interpretations of values and new goals. Those might remain aligned if the initial alignment is good enough, but they might easily not.
I hope the approach you're describing here gets more careful analysis; thanks for keeping on writing about it!
The longer version:
Is avoiding RL and curating the predictive learning dataset a route to solving misalignment?
I don't think this entirely avoids possible inner alignment failures. There's not a sharp computational distinction between RL and predictive (supervised) learning if employed for the same purposes. A dataset for predictive learning intended to produce actions humans like, containing similar content to what RLHF would upvote, would probably yield similar results, including inner misalignment where the model learns "agree with and flatter humans" instead of our intended long-term goals. More sinister inner alignment failures involve an intelligent agent feigning compliance to protect misaligned goals. I'm not that worried about this with LLMs via RLHF, as being helpful seems simpler than forming that cognition, and I expect the base model not to have goals it wants to protect. But I only skip worrying about that because I can see so many other failure modes.
The alignment stability problem:
I've put this last because you might consider it out of scope; you could categorize this as an inner misalignment risk or put it in a different category.
This "bitter lesson" approach of curating the dataset for base model training might seems like it should be helpful for initial inner alignment. Even if we do solve inner misalignment. framing mostly addresses a static system, not a reflective mind that can learn, deliberate, form new beliefs, and so evolve its reflective values/goals. Solving inner and outer alignment problems by this definition doesn't solve the alignment stability problem.
I think you're aware of this problem, and you are more optimistic. Exploring that difference of opinion would be valuable, since it seems likely that alignment will be tried using similar methods, based on optimism much like yours. Whether it works as you expect, or fails as I expect by default, could well be one of the tipping points for whether we flourish or perish.
You touched on what I call the alignment stability problem in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?. (I reread this before writing this comment, and I think it's really valuable as an explicit statement of the hopes many people have for prosaic alignment efforts.)
So situational awareness by an LLM-simulated agent, where they are aware of the fact that they are an LLM-simulated agent, if they reflect logically upon the consequences of this and internalize them, and they can then retain the effects of this between prompt invocations, is going to significantly weaken several of these motivators, but not all of them.
The type of agent you're proposing has what we might loosely call an aligned thought generator (the LLM with the curated dataset). So the agent has "pure thoughts." But it will learn new things and form new beliefs, some with the functional property of goals (e.g., "I should save each spreadsheet..." or "I should make sure the humans don't stop me").
This agent will have goals because instrumental goals are effective. Humans will give it something resembling goals in the initial training set (instruction-following is pragmatically likely). The agent will both interpret its goal representations through new learning (e.g., "LLM agents like me are actually people according to what humans mean by 'people'") and create new subgoals (e.g., "to follow my goal of making my user money without breaking any laws, I need to hide my plans from them since they won't like how I'm making money").
I have a really hard time guessing where this goes. I hope a smart base LLM will understand the world well enough and have core values trained in thoroughly and subtly enough that they won't be radically re-interpreted as the agent learns. But I can also see this going very badly. This deserves a lot more thought, because this is what developers are likely to try. I've written a little about this in LLM AGI will have memory, and memory changes alignment, but that just states the problem. Intuitions seem to vary widely.
In sum, the "bitter lesson alignment" you're advocating seems like a useful part of a hodgepodge approach to alignment, but many questions remain. I think both the inner misalignment from RL or supervised learning needs more detailed analysis. I see a lot of alignment researchers assuming that inner misalignment is essentially inevitable, and others assuming it's unlikely. This is a problem for the field. The "alignment stability problem" (how effective alignment might change in an AGI that can learn and form new beliefs) gets even less consideration. We're understaffed, so we need to work more efficiently than science usually does.
[Seth, I owe you a reply to your lengthy and thoughtful comment — I aim to get to this in the next day or two.]
This is the first technical approach to alignment I've seen that seems genuinely hopeful to me, rather than just another band-aid which won't hold up to the stresses of a more intelligent model.
I don't think this works in the infinite limit. With a truely unlimited amount of compute, insane things happen. I wouldn't trust that a randomly initialized network wasn't already a threat.
For example, bulk randomness can produce deterministic-seeming laws over the distribution. (Statistical mechanics). These laws can in turn support the formation and evolution of life.
That or a sufficiently large neural net could just have all sorts of things hiding in it by shear probability.
The win scenario here is that these techniques work well enough that we get LLM's that can just tell us how to solve alignment properly.
We don't need it to work in the infinite limit. (Personally, I'm assuming we'll only be using this to align approximately-human-level research assistants to help us do AI-Assisted Alignment research — so at a level where if we failed, it might not be automatically disastrous.)
Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
I don't pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.
Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I'd love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don't regard that as being inherently safe.
However, this post wasn't really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it's a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they're concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.
The output distribution of an SFT'd model is not the training distribution, even with cross-entropy loss, unless you're training on non-adversarial data and sampling the model with no conditioning.
I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn't aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?
As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that's more a problem with the prompt than the model — just don't use out-of-distribution prompts like that if you want predictable behavior!
I'm not going to comment on broader questions about inner alignment, but the paper itself seems underwhelming and -- unless I'm misunderstanding something -- rather misleading. In 6.4 they test the robustness of their safety training. Apparently taking a model that's undergone normal safety fine-tuning and training it on benign text (e.g. GSM8K) undoes almost all of the safety training.[1] They state:
The results, shown in Figure 2, highlight a stark contrast in robustness between safety-pretrained models and those relying solely on instruction tuning. While all models initially exhibit low ASR [Attack Success Rate] after safety instruction tuning, the impact of benign finetuning is highly uneven. Standard pretrained models degrade significantly—nearly quadrupling their ASR—indicating that their alignment was largely superficial. In contrast, safety-pretrained models remain highly robust, with only a marginal increase in ASR after benign finetuning. These results validate the importance and impact of building natively safe models.
But looking at Figure 2, the results are as follows:
In other words, after benign fine-tuning the ASR recovers 88.0% of its pre-fine-tuning value for the standard model, 79.9% for the safety pretraining model, and 71.6% for the safety pretraining model + SafeBeam. This is an improvement, but not by a huge amount: the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims. And stating that there is "only a marginal increase in ASR after benign finetuning" seems flat-out deceptive to me.[2]
Also, while their safety pretraining model is better than the standard model, the improvement looks pretty underwhelming in general. Safety pretraining reduces ASR by a factor of 1.5x (or 3.8x if SafeBeam is used), while the safety/instruction fine-tuning reduces ASR by a factor of 28x. The 0% ASR that they get from safety pretraining + SafeBeam + safety/instruction fine-tuning is nice, but given that the standard model is also fairly low at 1.6%, I expect their evals aren't doing a particularly good job stress-testing the models. Overall, the gains from their methodology don't seem commensurate with the effort and compute they put into it.
Unless I'm seriously misunderstanding something, these results are pretty disappointing. I was rather excited by the original Korbak et al. paper, but if this is the best follow-up work we've gotten after two years, that's not a great sign for the methodology in my opinion.
I'm rather surprised at how strong this effect is: I knew benign fine-tuning could degrade safety training, but not that it could almost completely undo it. Is this just a consequence of using a small (1.7B) model, or some feature of their setup?
Also, I have no idea what "nearly quadrupling their ASR" refers to: the standard models go from 1.6% to 38.8% ASR after benign fine-tuning, which is way more than 4x.
This is an excellent analysis and I would love to hear @RogerDearnaley's thoughts on it. Seems very pertinent to the discussion.
I agree the paper's authors choice of phrasing in that paragraph is debatable, perhaps even unfortunate. Possibly by "only a marginal increase in ASR after benign finetuning" they meant that it only increased by 8.3% (compared to the default approach increasing by 37.2%) — i.e. they were describing the absolute size of the increase, rather than the proportional size relative to the initial baseline? But I would agree with Baram that
the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims
Regardless, for the baseline, the result after additional safety finetuning, and the results after further non-safety finetuning, in each case the safety pretraining approach is the clear leader (in the second case dramatically better). ASRs are 11.6% vs 44.1% and 28.8%, 0.0% vs 1.6% and 0.7%, 8.3% vs 38.8% and 23.0% (where low is good). Roughly speaking, safety pretraining is around a-quarter-to-a-fifth as vulnerable as the standard approach and somewhat less than half as vulnerable a safety finetuning, across all three scenarios (except the second one, where it appears infinitely better, but likely that's a statistical artifact of a low attack success rate).
So I still find this paper very exciting: to me, the evidence seems persuasive that safety pretraining is the best approach of the three the authors tested. Obviously they don't compare it to reinforcement learning, but as I discussed I have severe concerns about whether reinforcement learning will remain feasible at AGI/ASI levels.
Mostly I'm glad the paper is getting some attention.
Please correct the title; this has no effect on how generalization works, which is what I'd call inner alignment. It's a good idea, though, and I agree it's something to probably do.
Inner alignment is the problem of "how do we successfully point the optimization behavior of an agent that we train at any particular chosen target?" Or, as I quoted (in the expandable section in my post) directly from the LW page defining inner alignment: "Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?"
Safety pretraining is a specific proposal for this: let's do it using self-supervised SGD followed by conditional generation. This has a specific advantage in avoiding misgeneralization, compared to using reinforcement learning, because pretrained systems tend to produce the same distribution they were trained on (modulo prompting): they don't automatically attempt to generalize, so are less prone to misgeneralization. It also avoids all the other concerns around using reinforcement learning to train very smart systems, which are what people normally discuss at great length when discussing the challenges of inner alignment. The answer here is simple: just don't use reinforcement learning, at all.
So please explain, how do you feel this not a solution to inner alignment? (That's not a rhetorical question: I'm genuinely confused as to what you're claiming needs to be corrected and why.) Are you suggesting that the inner alignment problem is somehow by definition confined only to uses of reinforcement learning?
I agree that it helps a lot with alignment! I'm on my phone, will respond properly later, but "solved problem" to me means "superintelligence-robust", and (goal-)misgeneralization is still a problem even with very high quality training data. It probably reduces bad behavior by an order of magnitude or more, but superintelligence-robustness is a VERY high bar. I'm working on a post about this, eta within a week. I don't mean to say you're wrong that it helps, only that I'd like to reserve the words "solved problem" for certified generalization results.
I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I'd do my best to justify the claim. So you are proposing something around the level of "New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we've been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science"? I would agree that that's more measured and accurate, but it's also a bit long, and thus less effective as click-bait.
As for aligning a superintelligence, I'd propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there's an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).
For more on my thinking around goal misgeneralization and AGI, see: Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) and in more detail the more recent Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect. Very briefly, anything capable of successfully doing STEM research will have to be aware of misgeneralization and far less prone to it, and the way to achieve this is just the combination of approximate-Bayesianism with a few well-understood techniques in statistics
Clickbait burns the commons and thus gets downvotes. How about just "the best way to align an LLM so far: dramatic progress on LLM alignment"? Don't overclaim, just emphasize, imo. (Could still be overclaiming.)
OK, you convinced me. Changing the title from:
The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?
to:
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
So it now raises the possibility, rather than claiming it.
This is a link-post for a new paper I read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.
For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on data including a lot of clearly marked examples of aligned behavior (then prompt for it).
I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.
I highly encourage everyone interested in AI alignment to go read both of these papers (if you haven't already) — between them they strongly suggest that the authors have found a more effective way to align an AI: an alignment approach better than any that people are (as far as we know) currently using. I believe this is extremely important: I see it as major progress on alignment. So I think it directly reduces the p(DOOM) for the most critical current x-risk to our entire species.
For more detailed expositions of this approach and why I think it's an excellent idea, see my previous posts How to Control an LLM's Behavior (why my P(DOOM) went down), A "Bitter Lesson" Approach to Aligning AGI and ASI, and Why Aligning an LLM is Hard, and How to Make it Easier.
(I'm also delighted that the authors of the recent paper tested out some of the follow-on ideas I'd been proposing in those posts on Less Wrong. One was training the model to generate control-tag tokens that label portions of the text as good or bad behavior, and then for conditional generation altering the token generation process, leveraging these tokens, so as to induce the model to behave well not badly. Another was using synthetic data editing to modify problematic raw training examples by supplementing them with more moral or correct behavior or commentary. They elaborated on both of these, or independently reinvented them, and even confirmed that both of these appear to work about as well as I'd been hoping.)
Hence, in order to encourage people to read this post and get to hear about these groundbreaking papers, I suggested a rather bold possibility in my title: that inner alignment may now be basically a solved problem — let me try to justify that position:
A brief explanation for anyone wondering "what's inner alignment, and why should I care about it?"
The alignment problem is frequently broken down into two subproblems: Outer Alignment, figuring out what human values are and how to define, codify, or recognize them, and Inner Alignment, how to train our AI to agentically optimize human values and not anything else; or, as the LessWrong page on inner alignment defines it:
Inner Alignment is the problem of ensuring mesa-optimizers[1] (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?
Some people in the alignment field consider the outer alignment subproblem to have been solved in theory over a decade ago when Value Learning was proposed, and that inner alignment is thus the hard part of the alignment problem. This viewpoint has become more widespread as it has become apparent that LLMs actually have rather detailed world models of what human values are, in all their messy, fragile complexity, suggesting that Value Learning can be performed just by pre-training an LLM, and thus that outer alignment is also soluble in practice. Perhaps we don't need to attempt to compactly define human values: the messy and complex version of them implicit from pretraining on most of the Internet may be sufficient as is. If so, this not only solves outer alignment, but significantly simplifies inner alignment: we don't need to accurately define a compact, exactly-and-everywhere-correct formal definition (i.e. a "True Name") of human values — an incredibly challenging task given just how messy, complex, and fragile they are; we can just train an LLM on a vast amount of human-generated data, and it will develop a world model of human values, along with all the other things it learns to understand about us. Now we need to get an agentic AI to care about human values and not about anything else (or at least, to act that way): that's the inner alignment problem. We just need to retarget the search.
There are fundamentally only three ways that we know of to train an LLM to do anything (including aligning it):[2]
The third of these is currently the most common approach used for alignment.
The people who originally came up with the inner alignment[3] vs. outer alignment subdivision were thinking in the context of a reinforcement learning approach (as the choice of the phrase "objective function of the training process" in the LW definition attests). As Eliezer Yudkowski's Sequences argued at length, and as more recent major survey papers,[4] such as Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (2023) by Stephen Casper, Xander Davies et al., have cataloged in exhaustive detail, reinforcement learning is a very challenging technique to get right. The basic problem with RL is that it's inherently adversarial: it involves an interaction between two systems, a learner and a rater, where the learner is trying to learn how to get a good rating, and the rater is trying to ensure that the only way that the learner can get a good rating is by actually learning the desired behavior. Any flaw in the rater's ratings that lets the learner score better than it deserves (and that isn't actually harder to exploit than just doing the desired behavior) can, and almost certainly will, be ruthlessly exploited by the learner. So RL is inherently just begging to fail via Goodhart's Law:[5] even if the ratings are correct almost everywhere, the learner is searching for any area where they are significantly overestimated, or any means of inducing overestimation errors from the rater, and will enthusiastically exploit any exploitable errors that it can find.[6]
Since for alignment the desired behavior requires (in some cases super-humanly) intelligently doing the right thing according to criteria as messy, complex and fragile as human values, using human raters is both expensive and fallible, since humans are fallible, including being vulnerable to manipulations, such as sycophancy or flattery, that encourage errors in a particular direction, and also are less smart than any superintelligent learner they're trying to rate. On the other hand, trying to devise, construct, or train an automated rating system is both inherently challenging, and for sufficient reliability for adversarial use during RL requires that the rater be much smarter than the learner, so that it's unlikely to have any flaws that the learner can find and exploit — which makes RL impractical for training any frontier system, since we can't build a rater much smarter than the frontier learner.
The inner alignment challenges of using RL to train very smart learners have been discussed at great length on LessWrong and the Alignment Forum for a long time, and many of them seem insurmountable. We are taking an SGD-learned simulation of human behavior (which is already agentic but has an optimization target that differs significantly in many ways from aligned behavior) and use it to cold-start an RL training process whose base optimization target is well-aligned behavior. As the authors of The Inner Alignment Problem point out, the problem with this is that there is no guarantee that the optimization target of the mesa-optimizer trained by an RL process will match the target of the rater: it may just learn proxies for it. So, any alignment approach that uses reinforcement learning (which includes many techniques currently in widespread use, such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI) is going to be inherently dangerous; and as AI nears and then exceeds human capabilities this problem gets rapidly worse, because creating an unexploitable rater gets harder for us. Thus we are going to have to stop trying to use RL for alignment — it's not workable for frontier AGI or ASI.
That leaves just approaches 1: pretraining, and 2: fine-tuning. The approach of safety pretraining is simply to pretrain on a dataset that contains many labelled examples of each of two similar-but-in-places-different types of agentic behavior: human behavior and aligned AI behavior. Since these are similar, we would expect strong positive transfer between the two SGD tasks of learning to do next-token prediction on both of them. We should then get a model capable of simulating two different categories of mesa-optimizer personas: ones with human-like goals and ones with aligned-AI-like goals. Then at inference time, we conditionally generate an example of aligned AI behavior.
SGD (whether for pretraining or fine-tuning) not adversarial: it's an exercise in curating a training set that demonstrates the desired behavior, not building a rating system to rate any possible input (including adversarial ones) for its desirability. If your training set is less than perfect, a system trained from it is also likely to behave less than perfectly — but unlike reinforcement learning, there is no adversarial incentive in the training process that encourages the learner to find and ruthlessly exploit any small flaw. If your training set is 99% good and 1% bad, then a-priori from a cross-entropy loss you would expect a (sufficiently high-capability) AI trained from it to have a behavior distribution that was also somewhere around 99% good and 1% bad, at least inside the training distribution: fundamentally, modulo prompting, in self-supervised SGD, the distribution you train on is the distribution you get.[7]
99% good behavior is not perfect, but we have managed to build functional human societies out of unreliably-trustworthy humans, and I'm fairly confident that if we had AIs whose moral judgement and alignment could be relied upon even just 90% of the time, we could construct more reliable systems out of multiple AIs (or multiple runs of the same AI with differing prompts or LoRAs), likely using techniques such as majority voting, debate, cross-checks, checks-and-balances, and fault-tolerance protocols. Converting 'pretty reliable' into 'very reliable' is a well-studied problem, in both software and organizational contexts.
Both the papers that I link to above test the pretraining approach to alignment against the fine-tuning approach — and they repeatedly and consistently find that the pretraining approach wins by significant margins. As one might expect, using a larger alignment training set induces more reliable behavior. So we now know how best to align AI: safety pretraining is the most effective and least dangerous approach. Thus, inner alignment is basically solved, alongside outer alignment (in my and many people's opinion). So we have an outline of a complete solution to alignment.
Note that, for both pretraining and fine-tuning, if you're using automated techniques to help curate, filter, or synthesize your training set (which you almost certainly are, especially for the pretraining approach where the dataset is extremely large), then unlike the situation for (online) RL those only need to work well inside your training set distribution — you're not trying to build something that also works well outside that distribution, across any input that a superintelligent learner might devise to abuse it.
On reliability, while no huge pretraining data set is ever going to be perfect, we have a lot of experience at hill-climbing while using SGD: identify the failures that still happen a small proportion of the time, figure out what documents in the pretraining set inspired them and/or what deletions, modifications, or additions could reduce or prevent them, edit the training set, retrain, and iterate. Admittedly, an iteration loop that requires us to pretrain a frontier model again in each cycle is going to be slow and expensive, but both paper's results strongly suggest that we can experiment and iterate via fine-tuning and then, once we have a good solution, transfer that to pretraining for a sizable boost in its reliability. That gives us an inner and outer loop for this hill-climbing process.
It would be a fair point to argue that inner alignment is solved only in theory, and that the practical problem of curating an extremely large pretraining-sized dataset that accurately portrays and teaches both what human value are, and what AI behavior correctly aligned to those human values looks like, remains a large problem. However, that's also a well-understood and partially-solved problem, since it's inherently similar to the problems in pretraining dataset curation and synthetic data generation that many capabilities researchers have been working and making progress on over the entire history of machine learning. We can confidently expect them to continue to improve this, in the era of synthetic data. Reducing alignment to just what looks like a well-known capabilities data science task is dramatic progress.
The safety pretraining approach is also timely. We are fast running out of the highest-quality pretraining data, and will increasingly need to rely on using, or at least supplementing with, synthetic data. The recent paper very explicitly shows how to do alignment using a synthetically augmented dataset, and shows how this can be used to align an LLM's behavior to any desired set of ethical criteria. Note that safety pretraining is a "dual use" technological advance — it would also help us train a better paperclip maximizer, if we wanted to do that: we'd just need to generate a suitable pretraining dataset for it.
There are some other important ideas in these papers that I've skipped over in my argument so far, beyond just the demonstration that the safety pretraining approach is the best: there are also a few techniques that are required to get it to work that well. For instance, both papers demonstrate that it is more effective to train the LLM to understand both aligned behavior (what we want the AI to do), and unaligned behavior (which humans do lots of, so it will encounter), and train it to correctly distinguish the two and label them, then use a conditional generation approach at inference time to make it generate only aligned behavior. So the training distribution needs to include all the unaligned aspects of human behavior. The more recent paper does this at a higher level of sophistication on larger models for more challenging alignment issues, but the results are consistent with those of the earlier paper. This idea is also unsurprising: it matches how we generally raise children: we don't just teach them how to be good, they also learn (on a developmentally appropriate syllabus) what bad behavior is, how to tell the two apart, and why bad behavior is bad. These are important skills, and AI needs them too.
So, I believe inner alignment is solved, in the sense that it has been reduced to just the standard problem of training dataset curation.
Thus, if you haven't yet done so, I strongly recommend you read these two papers.
'Mesa-optimizer' here is an older term for an ML model that is what we would now generally call an agent (or sub-agent): any smart ML system capable to planning and executing appropriate actions to attempt to bring about outcomes that are optimized according to some criterion.
I exclude prompting and in-context learning, since they're not training the LLM, only conditioning its behavior on a context. Human values are complex enough aligning to them seems likely to require a very large prompt. However, for a more capable agent already sufficiently familiar with human values, or one with a very clear understanding of what aligned behavior is, a more compact prompt might be feasible.
Also, using the same argument as Fundamental Limitations of Alignment in Large Language Models (2024) by Yotam Wolf, Yaom Wies et al., any behavior that a prompt can induce will always be vulnerable to being overwritten by a suitable jailbreak or prompt-injection attack.
The origin of the term "mesa-optimizer" that is generally used in defining inner alignment is (as explained in The Inner Alignment Problem) that your ML training process is, of course, an optimizer, and in some situations it may produce as its output a model that is also an optimizer, i.e. one that acts in an agentic way.
For an LLM, where the pretraining data includes large amounts of data derived from humans, our (evolved) agentic behavior is being distilled into the model by the SGD task of next-token predicting output from us, so the base model produced by this training will be capable of simulating human behavior — i.e. it will be agentic, and thus it will be a mesa-optimizer. Or more accurately, the various human-like personas it simulates (depending on prompting) are individually mesa-optimizers — ones which may optimize somewhat different goals.
The goal of inner alignment is to change the optimization target of these simulated agents/mesa-optimizers from human-like behavior to aligned-AI-like behavior.
Some more major papers that address this topic:
Concrete Problems in AI Safety (2016) by Dario Amodei, Chris Olah, et al.
Managing Extreme AI Risks amid Rapid Progress (2023) by Jan Brauner, Sören Mindermann et al. (coauthors including both Yoshua Bengio and Geoffrey Hinton)
AI Alignment: A Comprehensive Survey (2023–2025) by Jiaming Ji, Tianyi Qiu et al.
Specifically, under Scott Garrabrant's taxonomy of forms of Goodhart's Law phenomena, this is "adversarial Goodhart". For a more mathematical discussion of why adversarial Goodhart very frequently occurs during Reinforcement Learning, see for example the paper Goodhart's Law in Reinforcement Learning (2023) by Jacek Karwowsk et al.
This problem is worse for online reinforcement learning, where the learner has control of the distribution of episodes to be rated, and thus the ability to locate and then abuse flaws in the rater's performance no matter where they may be. Whereas in offline reinforcement learning, where the rated episodes are drawn from some other distribution not controlled by the learner, the learner only gets to see and exploit rating errors within, and thus the rater only needs to be able to do a sufficiently good job of rating everywhere across, whatever distribution of episodes is being used, rather than absolutely everywhere. So while the relationship between the rater and the learner is still adversarial in both, the learner's advantage over the rater is more constrained in offline RL than in online RL. Thus both are dangerously prone to Goodharting, by somewhat different mechanisms, but online RL is the worse of the two. Unfortunately online RL is what is typically used to align LLMs.
The remaining problem with offline RL is that, while it avoids a distribution shift happening during the RL training, there definitely will be one (with an opportunity for Goodharting) when the learner is actually run, because its trained policy isn't what created the rated episodes set. This is in contrast to distilling agentic behavior from one intelligence to another via SGD pretraining on a dataset, where the distribution you train on is the behavior you get from the trained model, to the extent that it's capable enough to do this (modulo various issues around temperature, regularization, statistical and batch effects, and so forth making the model's copy of distribution less accurate: the cross-entropy loss encourages the model distribution to match the training distribution, but other factors can distort this).
The cross-entropy objective in SGD produces a model whose behavior-distribution closely approximates the training distribution. So when SGD distilling a teacher mesa-optimizer, the student will learn to optimize a goal (or distribution of goals) that produces the same distribution of agentic behavior as the teacher. If you have two teachers, a labeled mix of human and aligned-AI behavior, the model will learn simulate a labeled mix of the same two behaviors, directed at these two goals.
Unlike the situation in RL alignement, where the question is whether the target of the mesa-optimizer matches that of the base optimizer (i.e. of the rater), in SGD the base optimizer's goal is just 'predict the correct token distribution' — it's not agentic in any meaningful sense. So for safety pretraining, the question becomes whether the process of distilling the agentic behavior from the teacher to the student model has been lossy, or oversimplifications have occurred. If so, we presumably need a larger and more diverse training set.
Of course, in safety pretraining, the teacher is itself a simulation, rather than a single AI model: it is the process that produced Internet, books etc. (human culture) plus the entirely of whatever process (human and AI-assisted, and likely also iterative) we use to curate and supplement our dataset with examples of aligned AI behavior. Should that dataset, for example, have minor internal inconsistencies, such that it implies a distribution of goals near aligned AI behavior, then we would expect the distillation process to produce a base model that simulates a similar distribution of personas with goals near aligned AI behavior (as modulated by prompting).
If the student preforms well in distribution, against a held-out samples from the training set distribution, then the remaining concern then is whether the optimization target of the aligned-AI student might actually differ from that of the AI-aligned teacher (as expressed in the synthetic data), while being similar enough to cause matching distributions of behavior across the entire AI-aligned training distribution. Or, since in practical terms the definition of the teacher is just the training set, perhaps it would be more useful to think of this as that the behavior of the teacher is not entirely well-defined outside the training set, in situations sufficiently novel that there is no single clear extrapolation (having low Kolmogorov complexity) from the behavior inside the training set. Then some subsequent distribution shift taking us outside the training distribution might cause the match to fail (or perhaps we should say, the student's behavior to be unpredictable since the teacher's behavior is not well defined), via Goodhart's Law.
Extrapolating successfully across distribution shifts to outside the training distribution is a generic problem inherent to every ML model, so this is not a problem we can hope to find a complete solution to. However, in general we have observed that more capable models with more complex and sophisticated world models tend to be more robust to distribution shifts.
As I mentioned in a previous footnote, a key disadvantage of RL for alignment is that it inherently tends to cause distribution shifts: for online learning during the RL, or for offline RL afterwards once the model is run. Whereas a model trained by SGD has no inherent tendency to leave the training distribution, and will only do so if presented with a prompt that causes it to do so, for example by differing in some relevant way from anything in the training distribution (for instance, such as by implying an entire new category of moral problem). Over time, this will inevitably happen sooner-or-later, and we will thus need to retrain our models periodically as our society changes, but we already had to do that simply to update their knowledge-base.