The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

There’s a potential failure mode where RL (e.g. RLVR or otherwise) is necessary to get powerful capabilities. Right?

I for one don’t really care about whether the LLMs of May 2025 are aligned or not, because they’re not that capable. E.g. they would not be able to autonomously write a business plan and found a company and grow it to $1B/year of revenue. So something has to happen between now and then to make AI more capable. And I for one expect that “something” to involve RL, for better or worse (well, mostly worse). I’ve been saying that RL is necessary for powerful capabilities, i.e. (self)-supervised learning will only get you so far, since I think 2020, shortly after I got into AGI safety, and that prediction of mine is arguably being borne out in a small way by the rise of RLVR (and I personally expect a much bigger shift towards RL before we get superintelligence).

What’s your take on that? This post seems to only talk about RL in the context of alignment not capabilities, unless I missed it. I didn’t read the linked papers.

[-]RogerDearnaley6mo33

My concern is that, if you're using RL to train a frontier system that's human-level or above, for alignment or capabilities purposes, is that it will inevitably find ways to abuse flaws in out RL rating system. One exception might be if the RL is for some capability like reasoning to produce a proof that passes proof checking, where it might be possible to create a rating system that actually has no flaws to exploit. I don't see how we could do that for RL for alignment, however.

[-]Steven Byrnes6mo63

Right, but what I'm saying is that there's at least a possibility that RL is the only way to train a frontier system that's human-level or above.

In that case, if the alignment plan is "Well just don't use RL!", then that would be synonymous with "Well just don't build AGI at all, ever!". Right?

...And yeah sure, you can say that, but it would be misleading to call it a solution to inner alignment, if indeed that's the situation we're in.

[-]RogerDearnaley6mo61

Why would we have to use RL to do this? The problem of building a rater for RL closely resembles automating the labelling problem for preparing the dataset for SGD safety pretraining, except that for online RL the rater is harder: it has to run fast, it can't be human assisted, and it has to be able to cope with arbitrary adversarial shifts in the distribution being rated and do so well enough for it to not have exploitable flaws. A rater for (or at least attaching ratings to the episode set for) offline RL is less bad: it's an almost equivalent problem to labelling a dataset for SGD, just attaching a score rather than a binary classification. The primary difference is that for the security pretraining approach the behavior we're training into the model is a classifier that labels behavior either good or bad, so isn't prone to Goodharting when you run it and ask for output from just one of the two categories, whereas for offline RL we're training a policy that tries to maximize the goodness rating, so is prone to Goodharting when the gradient towards the very "best" behavior leads it outside the training distribution. (The reason the SGD-trained classifier is safe is closely related to the satisficing approach to avoid Goodhart's Law.) So from the rating and stability point of view online RL is more challenging than offline RL, which is more challenging than security pretraining SGD.

Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples? Why do you think there at least a possibility that RL could be the only way to train a frontier system that's human-level or above? I'm not currently seeing any potential advantage of RL — other than the fact it induces distribution shifts, during training for online RL, or after it for offline RL, so doesn't require us to already know the distribution we want: but these distribution shifts are exactly the source of its danger.

Let me give you a detailed presciption. For whatever RL training scheme you think we need, convert the rater for that to a satisficing binary classifier (classes: good enough vs not good enough behavior), and run it over large training set of episodes matching the distribution of data you want your model to produce. Do SGD pretraining from that, and condition the generation from the result on the "good" label. My claim is that the output will be functionally equivalent your RL trained model, but its behavior will be more predictable in advance from the training set since there are no inherent distribution shifts. For there to be possibility that RL could be the only way to train a frontier system that's human-level or above, either this would need to be false, or some aspect of the proposed input would need to not be computable/generatable for us, other than via the RL training process (whose output can clearly generate this). Which of these are you proposing might occur?

[-]Steven Byrnes6mo*31

Let me give you a detailed presciption…

For example, people want AIs that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.

People are trying to build such AIs as we speak, and I don’t expect them to quit until they succeed (or until we all die from their attempt).

And it’s possible—if human brains (or groups of human brains) can do this, so can AI algorithms. But human brains involve (model-based) RL. It’s an open question whether there exists a non-RL algorithm that can also do that. (LLMs as of today obviously cannot.)

I think the issue here is: “some aspect of the proposed input would need to not be computable/generatable for us”.

If the business is supposed to be new and out-of-the-box and innovative, then how do you generate on-distribution data? It’s gonna be something that nobody has ever tried before; “out-of-distribution” is part of the problem description, right?

Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples?

Not all RL is “RL on [human] rated examples” in the way that you’re thinking of it. Jeff Bezos’s brain involves (model-based) RL, but it’s not like he tried millions of times to found millions of companies, and his brain gave a reward signal for the companies that grew to $1B/year revenue, and that’s how he wound up able to found and run Amazon. In fact Amazon was the first company he ever founded.

Over the course of my lifetime I’ve had a billion or so ideas pass through my head. My own brain RL system was labeling these ideas as good or bad (motivating or demotivating), and this has led to my learning over time to have more good ideas (“good” according to certain metrics in my own brain reward function). If a future AI was built like that, having a human hand-label the AI’s billion-or-so “thoughts” as good or bad would not be viable. (Futher discussion in §1.1 here). For one thing, there’s too many things to label. For another thing, the ideas-to-be-rated are inscrutable from the outside.

I’m also still curious how you think about RLVR. Companies are using RLVR right now to make their models better at math. Do you have thoughts on how they can make their models equally good at math without using RLVR, or any kind of RL, or anything functionally equivalent to RL?

Also, here’s a challenge which IMO requires RL [Update: oops, bad example, see Zack’s response]. I have just invented a chess variant, Steve-chess. It’s just like normal chess except that the rooks and bishops can only move up to four spaces at a time. I want to make a computer play that chess variant much better than any unassisted human ever will. I only want to spend a few person-years of R&D effort to make that happen (which rules out laborious hand-coding of strategy rules).

That’s the Steve-chess challenge. I can think of one way to solve the Steve-chess challenge: the AlphaZero approach. But that involves RL. Can you name any way to solve this same challenge without RL (or something functionally equivalent to RL)?

[-]Zack_M_Davis6mo85

Can you name any way to solve [chess but with rooks and bishops not being able to move more than four squares at a time] without RL (or something functionally equivalent to RL)?

This isn't even hard. Just take a pre-2017 chess engine, and edit the rules code so that rooks and bishops can only move four spaces. You're probably already done: the core minimax search still works, α–β pruning still works, quiescence still works, &c. To be fair, the heuristic evaluation function won't be correct, but you could just ... make bishops and rooks be respectively worth 2.5 and 3.5 points instead of the traditional 3 and 5? Even if my guess at those point values is wrong, that should still be easily superhuman with 2017 algorithms on 2017 hardware. (Stockfish didn't incorporate neural networks until 2020.)

[-]RogerDearnaley6mo20

Incidentally, there are a great many variant versions of chess with different piece-move rules (collectively sometimes called "fairy chess"), and I think even quite a lot of collected games for some of the more popular rule variants. Training an AI to play many types of fairy chess, and even arbitrary new just-invented ones, might be an interesting project that covers some aspects of generalizing out-of-distribution and positive transfer. A suitably-edited-for-the-variant version of Stockfish makes a pretty strong baseline for this. Using AlphaZero per variant is another obvious baseline.

[-]Steven Byrnes6mo20

Hmm, you’re probably right.

But I think my point would have worked if I had suggested a modified version of Go rather than chess?

[-]RogerDearnaley6mo*20

There's not a lot of scope for aligned/unaligned behavior in Go (or chess): it's a zero-sum game, so I don't see how any Go plays could be labeled as aligned or unaligned. How about some complex tactical or simulation game that actually has a scope for aligned/unaligned or at least moral/immoral behavior? Ideally one where you are roleplaying as an AI, so aligned behavior is appropriate, or at least doing some sort of resource management or strategy task that might get assigned to an AI.

[-]Steven Byrnes6mo20

I was trying to argue in favor of:

CLAIM: there are AI capabilities things that cannot be done without RL training (or something functionally equivalent to RL training).

It seems to me that, whether this claim is true or false, it has nothing to do with alignment, right?

[-]RogerDearnaley6mo20

There are certainly things that it's easier to do with RL — whether it's ever an absolute requirement I'm less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that's the case I'm not familiar with the details — I'd love references to anything relevant to this, if anyone has them.

My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it's basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.

[-]Gurkenglas6mo00

Suppose we lived in a spatially-finite universe with simple deterministic laws of physics that we have fully colonized, in which we can run a computation for any finite number of steps that we can specify. (For example, everyone agrees to hibernate until it's done.) Let's use it to play Go.

Run all ~2^2^33 programs ("contestants") that fit in a gigabyte against each other from all ~3^19^2 possible positions. Delete all contestants that use more than 2^2^2^2^2^100 CPU cycles on any one move. For every position from which some contestant wins every match, delete every contestant that doesn't win every match.

This enforces ~perfect play. Is it safe to pick a surviving contestant pseudorandomly? Not clearly: Consider the following reasonably-common kind of contestant.

Most of it is written in a custom programming language. This means it'll also need to contain an interpreter for that language, but probably overall this is more efficient than working in whatever language we picked. As a side effect, it knows most of its source code C.
Given input I, for each possible output O, it makes use of the logical consequences of "Source code C, given input I, produces output O.". For example, it might return the O for which it can prove the most consequences.

What logical consequences might it prove? "1=1" for one, but that will count towards every O. "Source code C, given input I, produces output O." for another, but that's a pretty long one. If it would be the survivor in line to be pseudorandomly selected, most consequences of its decision are via the effects on our universe!

So if it predicts that it would be selected^[1], it will output perfect play to survive, and then keep being consequentialist about any choice between two winning strategies - for example, it might spell out a message if we would watch the winner play, or it could steer our experiment's statistics to inspire a follow-up experiment that will, due to a bug, run forever rather than ever waking us up from hibernation.

^{^}
Or by some tweaking of 2., if it assumes that it would be selected because otherwise the choice of O doesn't matter,

[-]RogerDearnaley6mo*30

Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.

The SGD safety pretraining equivalent would be to include that transcript in the pretraining dataset (or, since such data is very rare and useful/high quality, perhaps an entrepreneurship-specific fine-tuning dataset). So far, very similar. You would also (likely AI-assisted) look through all of the transcript, and if you located any portions where the behavior was less wise or less moral/aligned than the behavior we'd like to see from an aligned AI-entrepreneur, label that potion with <|unaligned|> tags (or whatever), and perhaps also supplement it with commentary on subject like why it is less wise/moral/aligned than the standards for an aligned AI, what should have been done instead, and speculations around the likely results of those counterfactual actions.

[-]Steven Byrnes6mo30

Suppose we had a CoT-style transcript of every thought, word, email and action by the founder of a successful startup over the course of several years of its founding, and used this for RL: then we'd get a reward signal every time they landed a funding round, sales went up significantly, a hire they made or contract they signed clearly worked out well, and so forth — not enough training data by itself for RL, but perhaps a useful contribution.

I don’t think this approach would lead to an AI that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.

…So I expect that future AI programmers will keep trying different approaches until they succeed via some other approach.

And such “other approaches” certainly exist—for example, Jeff Bezos’s brain was able to found Amazon without training on any such dataset, right?

(Such datasets don’t exist anyway, and can’t exist, since human founders can’t write down every one of their thoughts, there are too many of them and they are not generally formulated in English.)

[-]RogerDearnaley6mo30

It's unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO's email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical.

In practice, I suspect we'll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we'll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn't been specifically trained on.

[-]Gurkenglas6mo120

"Algorithm 1: Safe Beam Search with Harmfulness Filtering" relies on a classifier of whether the sequence came from the training subdataset tagged with tau, or the training subdataset not tagged with tau. What happens when the sequence lies in neither distribution, such as because the AI is considering a plan that nobody has ever thought of?

[-]RogerDearnaley6mo63

The labeling used is for harmful material. The underlying logic here is that things are either harmful, or they're not. Higher capability LLMs with complex world models are generally significantly more successful at extrapolating tasks like this out-of-distribution that a basic classifier ML model would be, but it's not going to be perfect. If you come up with something that's way out in left field, the LLM may no longer be able to accurately classify it as harmful or not. The same is of course also true for humans, or any agent: it's an inherent challenge of of Bayesian learning — without enough evidence, in areas where extrapolating from the hypotheses you've learnt doesn't suffice, you don't yet know the answer. So you should be cautious moving out-of-distribution, especially far out of distribution in new ways that you've never seen before. But then, as everyone knows (including a capable AI based on an LLM), that's also true for many other reasons: if you don't know what you're doing, there are many dangers. A sensible heuristic would be to assume by default that going far out-of-distribution is harmful until proven otherwise — one way to try to implement this would be stating, motivating, and explaining it, and giving approving examples of other AIs showing caution in this situation, many times through in the pretraining set.

How could we possibly make any AI that wouldn't have this failure mode?

[-]Gurkenglas6mo52

Presumably in some domains its capabilities will generalize better OOD than its tau-classifier (and vice versa). You could try to have it err in the direction of tau in such cases, though neither paper seems to gesture at this.

Now whether things are harmful depends on the capability level. For example, you might trust an AI to send an email to a politician arguing for climate change or peacemaking if it's human-level, but not if it's smart enough to tell which second-order effects will dominate, such as inoculating the politician against the arguments, or distracting them from their work on AI regulation, or maneuvering them into drama with another faction.

You could try to put the AI's capabilities in context, if you know them, so things can be either-harmful-or-not again, though neither paper seems to gesture at this.

Such problems are characteristic of attempts to build an aligned system out of parts that are not, by themselves, aligned; they will search for ways to bypass your system. We could possibly figure out how to build aligned parts.

[-]Seth Herd6mo70

This is an interesting analysis. I think even asking whether inner alignment is solved is an overstatement, but this type of proposal is worth some serious consideration. It might very well be part of our first attempts at aligning real AGI. And those attempts don't seem obviously doomed - so figuring out if they are seems like a high priority.

The short version:

Maybe this works better than RL to avoid inner misalignment - if you can get people to do this instead of RL. However, supervised learning may still create inner misalignment if used for the same purposes as RL would be. There's an excellent LW post making a technical argument that the two are computationally equivalent in important ways if they're used for similar purposes - but I can't find it in my notes or by searching.

Even if it does work for to align an LLM, we're not guaranteed an aligned agent built around that LLM. Once it starts to learn and reflect, its effective values and goals will change unpredictably. You can put in initial behavioral tendencies that imply values and goals, but learning includes learning new interpretations of values and new goals. Those might remain aligned if the initial alignment is good enough, but they might easily not.

I hope the approach you're describing here gets more careful analysis; thanks for keeping on writing about it!

The longer version:

Is avoiding RL and curating the predictive learning dataset a route to solving misalignment?

I don't think this entirely avoids possible inner alignment failures. There's not a sharp computational distinction between RL and predictive (supervised) learning if employed for the same purposes. A dataset for predictive learning intended to produce actions humans like, containing similar content to what RLHF would upvote, would probably yield similar results, including inner misalignment where the model learns "agree with and flatter humans" instead of our intended long-term goals. More sinister inner alignment failures involve an intelligent agent feigning compliance to protect misaligned goals. I'm not that worried about this with LLMs via RLHF, as being helpful seems simpler than forming that cognition, and I expect the base model not to have goals it wants to protect. But I only skip worrying about that because I can see so many other failure modes.

The alignment stability problem:

I've put this last because you might consider it out of scope; you could categorize this as an inner misalignment risk or put it in a different category.

This "bitter lesson" approach of curating the dataset for base model training might seems like it should be helpful for initial inner alignment. Even if we do solve inner misalignment. framing mostly addresses a static system, not a reflective mind that can learn, deliberate, form new beliefs, and so evolve its reflective values/goals. Solving inner and outer alignment problems by this definition doesn't solve the alignment stability problem.

I think you're aware of this problem, and you are more optimistic. Exploring that difference of opinion would be valuable, since it seems likely that alignment will be tried using similar methods, based on optimism much like yours. Whether it works as you expect, or fails as I expect by default, could well be one of the tipping points for whether we flourish or perish.

You touched on what I call the alignment stability problem in Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?. (I reread this before writing this comment, and I think it's really valuable as an explicit statement of the hopes many people have for prosaic alignment efforts.)

So situational awareness by an LLM-simulated agent, where they are aware of the fact that they are an LLM-simulated agent, if they reflect logically upon the consequences of this and internalize them, and they can then retain the effects of this between prompt invocations, is going to significantly weaken several of these motivators, but not all of them.

The type of agent you're proposing has what we might loosely call an aligned thought generator (the LLM with the curated dataset). So the agent has "pure thoughts." But it will learn new things and form new beliefs, some with the functional property of goals (e.g., "I should save each spreadsheet..." or "I should make sure the humans don't stop me").

This agent will have goals because instrumental goals are effective. Humans will give it something resembling goals in the initial training set (instruction-following is pragmatically likely). The agent will both interpret its goal representations through new learning (e.g., "LLM agents like me are actually people according to what humans mean by 'people'") and create new subgoals (e.g., "to follow my goal of making my user money without breaking any laws, I need to hide my plans from them since they won't like how I'm making money").

I have a really hard time guessing where this goes. I hope a smart base LLM will understand the world well enough and have core values trained in thoroughly and subtly enough that they won't be radically re-interpreted as the agent learns. But I can also see this going very badly. This deserves a lot more thought, because this is what developers are likely to try. I've written a little about this in LLM AGI will have memory, and memory changes alignment, but that just states the problem. Intuitions seem to vary widely.

In sum, the "bitter lesson alignment" you're advocating seems like a useful part of a hodgepodge approach to alignment, but many questions remain. I think both the inner misalignment from RL or supervised learning needs more detailed analysis. I see a lot of alignment researchers assuming that inner misalignment is essentially inevitable, and others assuming it's unlikely. This is a problem for the field. The "alignment stability problem" (how effective alignment might change in an AGI that can learn and form new beliefs) gets even less consideration. We're understaffed, so we need to work more efficiently than science usually does.

[-]RogerDearnaley6mo20

[Seth, I owe you a reply to your lengthy and thoughtful comment — I aim to get to this in the next day or two.]

[-]whestler6mo4-2

This is the first technical approach to alignment I've seen that seems genuinely hopeful to me, rather than just another band-aid which won't hold up to the stresses of a more intelligent model.

[-]Donald Hobson6mo20

I don't think this works in the infinite limit. With a truely unlimited amount of compute, insane things happen. I wouldn't trust that a randomly initialized network wasn't already a threat.

For example, bulk randomness can produce deterministic-seeming laws over the distribution. (Statistical mechanics). These laws can in turn support the formation and evolution of life.

That or a sufficiently large neural net could just have all sorts of things hiding in it by shear probability.

The win scenario here is that these techniques work well enough that we get LLM's that can just tell us how to solve alignment properly.

[-]RogerDearnaley6mo33

We don't need it to work in the infinite limit. (Personally, I'm assuming we'll only be using this to align approximately-human-level research assistants to help us do AI-Assisted Alignment research — so at a level where if we failed, it might not be automatically disastrous.)

[-]JasonB6mo10

Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.

RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
The output distribution of an SFT'd model is not the training distribution, even with cross-entropy loss, unless you're training on non-adversarial data and sampling the model with no conditioning. Data poisoning attacks can happen and influence outputs much more strongly than would be expected by the proportion of them in the training data. When you prompt an LLM to be a capable chat or agentic model, you're already moving quite far out of its normal training distribution, and so you cannot make good predictions on how it will behave based on general proportions of good / bad training data and its annotations.
A lot of my probability mass for AIs doing bad things comes from a mixture of out-of-context reasoning and situational awareness leading to unexpectedly intelligent behaviour. Models have been shown to be capable of doing these. I'd predict both of these (especially co-occurring) would degrade the extent to which an alignment token / conditioning would work.
There might be other capabilities models acquire on the way to stronger performance like the aforementioned ones that interact with this in a negative way.
I don't think this actually addresses inner alignment as effectively as you imagine. I think in the situation you're considering where you prompt the model with this alignment conditioning, it's not guaranteed that the model will have correctly generalised the meaning of this objective from what it saw in pre-training to being a totally aligned agent. I agree it probably helps a lot, but you still face the same-old inner-alignment issues of correctly generalising something that was seen in (pre-)training to deployment, when deployment looks different to training. This is somewhat of a generalisation of points 2-4 above.
Whilst this helps a lot with outer-alignment, I don't think it solves that completely either. Yes models are able to recognise and probably correctly annotate a lot of data for pre-training, but are we really confident this is going to effectively capture human values, or some coherent-extrapolated-volition thereof? Even with all annotations correct, this objective seems like "output nice, safe-looking text that a human might output" and not "genuinely optimise for human values". Value alignment aside, it's probable this method will greatly help with intent alignment, and whilst intent alignment is probably a good target for current frontier AI, it that comes with many of its own problems, primarily misuse, where another large chunk of my P(doom) comes from.
I also want to generalise points 3, 4, 6, and what Steven Byrnes is claiming into: Training a model to act like an aligned human-level intelligence is not the same as training a model to act like an aligned super-intelligence, and whatever we do to raise capabilities here may also alter or break alignment, and so cannot be relied upon.

TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.

[-]RogerDearnaley6mo20

RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.

I don't pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.

Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I'd love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don't regard that as being inherently safe.

However, this post wasn't really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it's a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they're concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.

The output distribution of an SFT'd model is not the training distribution, even with cross-entropy loss, unless you're training on non-adversarial data and sampling the model with no conditioning.

I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn't aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?

As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that's more a problem with the prompt than the model — just don't use out-of-distribution prompts like that if you want predictable behavior!

[-]Baram Sosis6mo10

I'm not going to comment on broader questions about inner alignment, but the paper itself seems underwhelming and -- unless I'm misunderstanding something -- rather misleading. In 6.4 they test the robustness of their safety training. Apparently taking a model that's undergone normal safety fine-tuning and training it on benign text (e.g. GSM8K) undoes almost all of the safety training.^[1] They state:

The results, shown in Figure 2, highlight a stark contrast in robustness between safety-pretrained models and those relying solely on instruction tuning. While all models initially exhibit low ASR [Attack Success Rate] after safety instruction tuning, the impact of benign finetuning is highly uneven. Standard pretrained models degrade significantly—nearly quadrupling their ASR—indicating that their alignment was largely superficial. In contrast, safety-pretrained models remain highly robust, with only a marginal increase in ASR after benign finetuning. These results validate the importance and impact of building natively safe models.

But looking at Figure 2, the results are as follows:

For a Standard Pretraining model: 44.1% ASR before safety/instruction fine-tuning, 1.6% after safety/instruction fine-tuning, 38.8% after fine-tuning on benign data (GSM8K)
For a Safety Pretraining model: 28.8%, 0.7%, 23.0%
For a Safety Pretraining model plus their SafeBeam sampling: 11.6%, 0.0%, 8.3%

In other words, after benign fine-tuning the ASR recovers 88.0% of its pre-fine-tuning value for the standard model, 79.9% for the safety pretraining model, and 71.6% for the safety pretraining model + SafeBeam. This is an improvement, but not by a huge amount: the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims. And stating that there is "only a marginal increase in ASR after benign finetuning" seems flat-out deceptive to me.^[2]

Also, while their safety pretraining model is better than the standard model, the improvement looks pretty underwhelming in general. Safety pretraining reduces ASR by a factor of 1.5x (or 3.8x if SafeBeam is used), while the safety/instruction fine-tuning reduces ASR by a factor of 28x. The 0% ASR that they get from safety pretraining + SafeBeam + safety/instruction fine-tuning is nice, but given that the standard model is also fairly low at 1.6%, I expect their evals aren't doing a particularly good job stress-testing the models. Overall, the gains from their methodology don't seem commensurate with the effort and compute they put into it.

Unless I'm seriously misunderstanding something, these results are pretty disappointing. I was rather excited by the original Korbak et al. paper, but if this is the best follow-up work we've gotten after two years, that's not a great sign for the methodology in my opinion.

^{^}
I'm rather surprised at how strong this effect is: I knew benign fine-tuning could degrade safety training, but not that it could almost completely undo it. Is this just a consequence of using a small (1.7B) model, or some feature of their setup?
^{^}
Also, I have no idea what "nearly quadrupling their ASR" refers to: the standard models go from 1.6% to 38.8% ASR after benign fine-tuning, which is way more than 4x.

[-][anonymous]6mo10

This is an excellent analysis and I would love to hear @RogerDearnaley's thoughts on it. Seems very pertinent to the discussion.

[-]RogerDearnaley6mo42

I agree the paper's authors choice of phrasing in that paragraph is debatable, perhaps even unfortunate. Possibly by "only a marginal increase in ASR after benign finetuning" they meant that it only increased by 8.3% (compared to the default approach increasing by 37.2%) — i.e. they were describing the absolute size of the increase, rather than the proportional size relative to the initial baseline? But I would agree with Baram that

the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims

Regardless, for the baseline, the result after additional safety finetuning, and the results after further non-safety finetuning, in each case the safety pretraining approach is the clear leader (in the second case dramatically better). ASRs are 11.6% vs 44.1% and 28.8%, 0.0% vs 1.6% and 0.7%, 8.3% vs 38.8% and 23.0% (where low is good). Roughly speaking, safety pretraining is around a-quarter-to-a-fifth as vulnerable as the standard approach and somewhat less than half as vulnerable a safety finetuning, across all three scenarios (except the second one, where it appears infinitely better, but likely that's a statistical artifact of a low attack success rate).

So I still find this paper very exciting: to me, the evidence seems persuasive that safety pretraining is the best approach of the three the authors tested. Obviously they don't compare it to reinforcement learning, but as I discussed I have severe concerns about whether reinforcement learning will remain feasible at AGI/ASI levels.

Mostly I'm glad the paper is getting some attention.

[-]the gears to ascension6mo10

Please correct the title; this has no effect on how generalization works, which is what I'd call inner alignment. It's a good idea, though, and I agree it's something to probably do.

[This comment is no longer endorsed by its author]Reply

[-]RogerDearnaley6mo41

Inner alignment is the problem of "how do we successfully point the optimization behavior of an agent that we train at any particular chosen target?" Or, as I quoted (in the expandable section in my post) directly from the LW page defining inner alignment: "Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?"

Safety pretraining is a specific proposal for this: let's do it using self-supervised SGD followed by conditional generation. This has a specific advantage in avoiding misgeneralization, compared to using reinforcement learning, because pretrained systems tend to produce the same distribution they were trained on (modulo prompting): they don't automatically attempt to generalize, so are less prone to misgeneralization. It also avoids all the other concerns around using reinforcement learning to train very smart systems, which are what people normally discuss at great length when discussing the challenges of inner alignment. The answer here is simple: just don't use reinforcement learning, at all.

So please explain, how do you feel this not a solution to inner alignment? (That's not a rhetorical question: I'm genuinely confused as to what you're claiming needs to be corrected and why.) Are you suggesting that the inner alignment problem is somehow by definition confined only to uses of reinforcement learning?

[-]the gears to ascension6mo32

I agree that it helps a lot with alignment! I'm on my phone, will respond properly later, but "solved problem" to me means "superintelligence-robust", and (goal-)misgeneralization is still a problem even with very high quality training data. It probably reduces bad behavior by an order of magnitude or more, but superintelligence-robustness is a VERY high bar. I'm working on a post about this, eta within a week. I don't mean to say you're wrong that it helps, only that I'd like to reserve the words "solved problem" for certified generalization results.

[-]RogerDearnaley6mo53

I did quite intentionally include a question mark in the post title, and then early in the post admit that the title was somewhat click-baity, but that I'd do my best to justify the claim. So you are proposing something around the level of "New approach makes dramatic progress towards solving inner alignment, bypassing almost all the problems we've been discussing for many years, and reducing it to mostly just a well-understood challenge in Data Science"? I would agree that that's more measured and accurate, but it's also a bit long, and thus less effective as click-bait.

As for aligning a superintelligence, I'd propose using this approach to near-align something approaching or around AGI, then using that to help us do AI-assisted alignment (which in this approach, is mostly AI-assisted dataset curation), leading on (as capabilities increase towards ASI) to value learning. See a couple of my other posts on why I believe there's an area of convergence via value learning around full alignment (if you have a sufficiently good solution to inner alignment).

For more on my thinking around goal misgeneralization and AGI, see: Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) and in more detail the more recent Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect. Very briefly, anything capable of successfully doing STEM research will have to be aware of misgeneralization and far less prone to it, and the way to achieve this is just the combination of approximate-Bayesianism with a few well-understood techniques in statistics

[-]the gears to ascension6mo63

Clickbait burns the commons and thus gets downvotes. How about just "the best way to align an LLM so far: dramatic progress on LLM alignment"? Don't overclaim, just emphasize, imo. (Could still be overclaiming.)

[-]RogerDearnaley6mo74

OK, you convinced me. Changing the title from:

The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?

to:

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

So it now raises the possibility, rather than claiming it.

[+][comment deleted]6mo10

^{^}

'Mesa-optimizer' here is an older term for an ML model that is what we would now generally call an agent (or sub-agent): any smart ML system capable to planning and executing appropriate actions to attempt to bring about outcomes that are optimized according to some criterion.

^{^}

I exclude prompting and in-context learning, since they're not training the LLM, only conditioning its behavior on a context. Human values are complex enough aligning to them seems likely to require a very large prompt. However, for a more capable agent already sufficiently familiar with human values, or one with a very clear understanding of what aligned behavior is, a more compact prompt might be feasible.

Also, using the same argument as Fundamental Limitations of Alignment in Large Language Models (2024) by Yotam Wolf, Yaom Wies et al., any behavior that a prompt can induce will always be vulnerable to being overwritten by a suitable jailbreak or prompt-injection attack.

^{^}

The origin of the term "mesa-optimizer" that is generally used in defining inner alignment is (as explained in The Inner Alignment Problem) that your ML training process is, of course, an optimizer, and in some situations it may produce as its output a model that is also an optimizer, i.e. one that acts in an agentic way.

For an LLM, where the pretraining data includes large amounts of data derived from humans, our (evolved) agentic behavior is being distilled into the model by the SGD task of next-token predicting output from us, so the base model produced by this training will be capable of simulating human behavior — i.e. it will be agentic, and thus it will be a mesa-optimizer. Or more accurately, the various human-like personas it simulates (depending on prompting) are individually mesa-optimizers — ones which may optimize somewhat different goals.

The goal of inner alignment is to change the optimization target of these simulated agents/mesa-optimizers from human-like behavior to aligned-AI-like behavior.

^{^}

Some more major papers that address this topic:

Concrete Problems in AI Safety (2016) by Dario Amodei, Chris Olah, et al.

Managing Extreme AI Risks amid Rapid Progress (2023) by Jan Brauner, Sören Mindermann et al. (coauthors including both Yoshua Bengio and Geoffrey Hinton)

AI Alignment: A Comprehensive Survey (2023–2025) by Jiaming Ji, Tianyi Qiu et al.

^{^}

Specifically, under Scott Garrabrant's taxonomy of forms of Goodhart's Law phenomena, this is "adversarial Goodhart". For a more mathematical discussion of why adversarial Goodhart very frequently occurs during Reinforcement Learning, see for example the paper Goodhart's Law in Reinforcement Learning (2023) by Jacek Karwowsk et al.

^{^}

This problem is worse for online reinforcement learning, where the learner has control of the distribution of episodes to be rated, and thus the ability to locate and then abuse flaws in the rater's performance no matter where they may be. Whereas in offline reinforcement learning, where the rated episodes are drawn from some other distribution not controlled by the learner, the learner only gets to see and exploit rating errors within, and thus the rater only needs to be able to do a sufficiently good job of rating everywhere across, whatever distribution of episodes is being used, rather than absolutely everywhere. So while the relationship between the rater and the learner is still adversarial in both, the learner's advantage over the rater is more constrained in offline RL than in online RL. Thus both are dangerously prone to Goodharting, by somewhat different mechanisms, but online RL is the worse of the two. Unfortunately online RL is what is typically used to align LLMs.

The remaining problem with offline RL is that, while it avoids a distribution shift happening during the RL training, there definitely will be one (with an opportunity for Goodharting) when the learner is actually run, because its trained policy isn't what created the rated episodes set. This is in contrast to distilling agentic behavior from one intelligence to another via SGD pretraining on a dataset, where the distribution you train on is the behavior you get from the trained model, to the extent that it's capable enough to do this (modulo various issues around temperature, regularization, statistical and batch effects, and so forth making the model's copy of distribution less accurate: the cross-entropy loss encourages the model distribution to match the training distribution, but other factors can distort this).

^{^}

The cross-entropy objective in SGD produces a model whose behavior-distribution closely approximates the training distribution. So when SGD distilling a teacher mesa-optimizer, the student will learn to optimize a goal (or distribution of goals) that produces the same distribution of agentic behavior as the teacher. If you have two teachers, a labeled mix of human and aligned-AI behavior, the model will learn simulate a labeled mix of the same two behaviors, directed at these two goals.

Unlike the situation in RL alignement, where the question is whether the target of the mesa-optimizer matches that of the base optimizer (i.e. of the rater), in SGD the base optimizer's goal is just 'predict the correct token distribution' — it's not agentic in any meaningful sense. So for safety pretraining, the question becomes whether the process of distilling the agentic behavior from the teacher to the student model has been lossy, or oversimplifications have occurred. If so, we presumably need a larger and more diverse training set.

Of course, in safety pretraining, the teacher is itself a simulation, rather than a single AI model: it is the process that produced Internet, books etc. (human culture) plus the entirely of whatever process (human and AI-assisted, and likely also iterative) we use to curate and supplement our dataset with examples of aligned AI behavior. Should that dataset, for example, have minor internal inconsistencies, such that it implies a distribution of goals near aligned AI behavior, then we would expect the distillation process to produce a base model that simulates a similar distribution of personas with goals near aligned AI behavior (as modulated by prompting).

If the student preforms well in distribution, against a held-out samples from the training set distribution, then the remaining concern then is whether the optimization target of the aligned-AI student might actually differ from that of the AI-aligned teacher (as expressed in the synthetic data), while being similar enough to cause matching distributions of behavior across the entire AI-aligned training distribution. Or, since in practical terms the definition of the teacher is just the training set, perhaps it would be more useful to think of this as that the behavior of the teacher is not entirely well-defined outside the training set, in situations sufficiently novel that there is no single clear extrapolation (having low Kolmogorov complexity) from the behavior inside the training set. Then some subsequent distribution shift taking us outside the training distribution might cause the match to fail (or perhaps we should say, the student's behavior to be unpredictable since the teacher's behavior is not well defined), via Goodhart's Law.

Extrapolating successfully across distribution shifts to outside the training distribution is a generic problem inherent to every ML model, so this is not a problem we can hope to find a complete solution to. However, in general we have observed that more capable models with more complex and sophisticated world models tend to be more robust to distribution shifts.

As I mentioned in a previous footnote, a key disadvantage of RL for alignment is that it inherently tends to cause distribution shifts: for online learning during the RL, or for offline RL afterwards once the model is run. Whereas a model trained by SGD has no inherent tendency to leave the training distribution, and will only do so if presented with a prompt that causes it to do so, for example by differing in some relevant way from anything in the training distribution (for instance, such as by implying an entire new category of moral problem). Over time, this will inevitably happen sooner-or-later, and we will thus need to retrain our models periodically as our society changes, but we already had to do that simply to update their knowledge-base.

LESSWRONG
LW

LESSWRONG
LW

31

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

31

Ω 7

31

Ω 7