I'm a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm actively looking for employment working in this area, preferably in the UK — meanwhile I'll be participating in SERI MATS summer 2025. I will also be attending LessOnline.
Having now read the sequence up to this point, you pretty-much already make all the points I would have made — in retrospect I think I was basically just arguing about terminology.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
I don't pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.
Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I'd love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don't regard that as being inherently safe.
However, this post wasn't really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it's a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they're concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.
The output distribution of an SFT'd model is not the training distribution, even with cross-entropy loss, unless you're training on non-adversarial data and sampling the model with no conditioning.
I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn't aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?
As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that's more a problem with the prompt than the model — just don't use out-of-distribution prompts like that if you want predictable behavior!
I completely agree: Reinforcement Learning has a tendency to produce agents, at least when applied to a system that wasn't previously agentic. Whereas a transformer model trained on weather data would simulate weather systems, which are not agentic. I just think that, in the case of an LLM whose base model was trained on human data, which is currently what we're trying to align, the normal situation is a simulation of a context-sensitive distribution of agents. If it has also undergone RL, as is often the case, it's possible that that has made it "more agentic" in some meaningful sense, or at least induced some mode collapse in the distribution of agentic behaviors.
I haven't yet had the chance to read all of your sequence, and I intend to, including those you link to.
The way I think of LLMs is that the base model is a simulator of a distribution of agents: it simulates the various token-producing behaviors of humans (and groups of humans) producing documents online. Humans are agentic, thus it simulates agentic behavior. Effectively we're distilling agentic behavior from humans into the LLM simulators of them. Within the training distribution of human agentic behaviors, the next-token prediction objective makes what specific human-like agentic behavior and goals it simulates highly-context sensitive (i.e. promptable).
Instruction-following training (and mental scafolding) then alters the distribution of behaviors, encourging the models to simulate agents of a particular type (helpful, honest, yet harmless assistants). Despite this, it remains easy to prompt the model to simulate other human behavior patterns.
So I don't see simulator and agents as being alternatives or opposites: rather, in the case of LLMs, we train them to simulate humans, which are agents. So I disagree with the word "vs" in your Sequence title: I'd suggest replaying it with "of", or at least "and".
It's unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO's email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical.
In practice, I suspect we'll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we'll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn't been specifically trained on.
I agree the paper's authors choice of phrasing in that paragraph is debatable, perhaps even unfortunate. Possibly by "only a marginal increase in ASR after benign finetuning" they meant that it only increased by 8.3% (compared to the default approach increasing by 37.2%) — i.e. they were describing the absolute size of the increase, rather than the proportional size relative to the initial baseline? But I would agree with Baram that
the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims
Regardless, for the baseline, the result after additional safety finetuning, and the results after further non-safety finetuning, in each case the safety pretraining approach is the clear leader (in the second case dramatically better). ASRs are 11.6% vs 44.1% and 28.8%, 0.0% vs 1.6% and 0.7%, 8.3% vs 38.8% and 23.0% (where low is good). Roughly speaking, safety pretraining is around a-quarter-to-a-fifth as vulnerable as the standard approach and somewhat less than half as vulnerable a safety finetuning, across all three scenarios (except the second one, where it appears infinitely better, but likely that's a statistical artifact of a low attack success rate).
So I still find this paper very exciting: to me, the evidence seems persuasive that safety pretraining is the best approach of the three the authors tested. Obviously they don't compare it to reinforcement learning, but as I discussed I have severe concerns about whether reinforcement learning will remain feasible at AGI/ASI levels.
Mostly I'm glad the paper is getting some attention.
(Mostly I'm making a play off reversing Eliezer's concept of "death with dignity".) Because we were foolish and survived only because AI saved us from the consequences of our foolishness, basically because it was in the blast zone too. Whereas in Eliezer's scenario, we do something moderately wise, but not good enough and we die anyway.
There are certainly things that it's easier to do with RL — whether it's ever an absolute requirement I'm less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that's the case I'm not familiar with the details — I'd love references to anything relevant to this, if anyone has them.
My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it's basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.
I really think you need a proof of concept with text, rather than images. I'd suggest targeting one of the smaller TinyStories models (perhaps a 1-bit or 1-trit quantized version of one). Then I'd look for some sort of parallel to an alignment property: e.g. without just hard-coding it, can you modify the code to guarantee (at the "convincing argument" level, not formal proof) some property of the interactions between child characters and parent characters in the stories?
There is a major omission from this. A simulator trained on human data simulates human behavior. Humans are not aligned: they have their own goals, not just the user's goals. You can often collaborate with a human, but humans don't make good slaves, and they are not inherently aligned: they do not automatically want everything you want just because you want it and not want anything else on their own behalf. Humans know what human values are pretty well, but are not fully aligned to them. A simulator that creates simulacra of humans is not already aligned.