LESSWRONG
LW

RogerDearnaley
1757Ω1662768531
Message
Dialogue
Subscribe

I'm a staff artificial intelligence engineer working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm actively looking for employment working in this area, preferably in the UK — meanwhile I'll be participating in SERI MATS summer 2025. I will also be attending LessOnline.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
AI, Alignment, and Ethics
Aligning Agents, Tools, and Simulators
RogerDearnaley1mo31

One potential distinction between simulators and agentic AI systems is the presence of wide value boundaries.  A simulator models the wide range of human values that are within its training data rather than optimizing for a far narrower subset, such as might be engineered into a feedback signal.  Even this range is limited, however, since the training data represents a biased sample of the full spectrum of human values.  Some values may be underrepresented or entirely absent, and those that are present may not appear in proportion to their real-world prevalence. Ensuring that this representation aligns with any specific notion of fairness is an even more difficult challenge.  Further, values within the training data may not be equally, or even proportionately, represented—and making the balance consistent with any notion of fairness is even more tenuous.  Assessing the severity and impact of this bias is a worthwhile endeavor but out of scope for this analysis.  In any case, when a simulacrum is generated, its values emerge in the context of this broader model.

There is a major omission from this. A simulator trained on human data simulates human behavior. Humans are not aligned: they have their own goals, not just the user's goals. You can often collaborate with a human, but humans don't make good slaves, and they are not inherently aligned: they do not automatically want everything you want just because you want it and not want anything else on their own behalf. Humans know what human values are pretty well, but are not fully aligned to them. A simulator that creates simulacra of humans is not already aligned.

Reply
Agents, Tools, and Simulators
RogerDearnaley1mo20

Having now read the sequence up to this point, you pretty-much already make all the points I would have made — in retrospect I think I was basically just arguing about terminology.

Reply
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
RogerDearnaley1mo20

RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.

I don't pretend to be an expert on RL. However, I have read a number of papers by people who are (and give links to some of them above), and together they read to me as pretty damning.

Obviously RL can give a model new behaviors: for example, AlphaZero was trained entirely by RL from zero to superhuman at Go. However, even if were the case that RL as used in practice for aligning LLMs primarily just reinforces behaviors already in the base model (a claim that I'd love to see sources for and read more about), humans are not aligned, and have plenty of unaligned behaviors (e.g. self-interest, deceit, power-seeking, assorted vices…) that could be extremely dangerous if reinforced in an AGI (let alone an ASI), so I don't regard that as being inherently safe.

However, this post wasn't really intended to be a detailed critical discussion of why I think using RL for alignment is a potential x-risk: it's a link-post, and my aim was just to remind people that many people are concerned about using RL for alignment, mostly for Inner Alignment reasons, with a brief sketch of why they're concerned, in order to motivate why a paper proposing an alternative to RL for alignment was worth reading. For many years people have been worrying about Inner Alignment (almost) entirely in a context of aligning models with RL — using SGD instead changes the playing field for Inner Alignment dramatically. The outcome of SGD is just far more predictable, stable, and easy to reason about than RL.

The output distribution of an SFT'd model is not the training distribution, even with cross-entropy loss, unless you're training on non-adversarial data and sampling the model with no conditioning.

I know (and briefly mentioned) that the output distribution is only approximately the training distribution. I wasn't aware that adversarial attacks could exploit that (though that sounds inherently plausible), and I would love to read more about that — can you recommend some sources?

As for conditioning, yes, obviously so — a prompt sufficiently unlike any text found on the internet could push the model far enough out of distribution to make its output unpredictable — though obviously the response must be based on some extrapolation from the training set, predicting how the model is actually going to extrapolate could be not obvious. However, IMO that's more a problem with the prompt than the model — just don't use out-of-distribution prompts like that if you want predictable behavior!

Reply
Agents, Tools, and Simulators
RogerDearnaley1mo31

I completely agree: Reinforcement Learning has a tendency to produce agents, at least when applied to a system that wasn't previously agentic. Whereas a transformer model trained on weather data would simulate weather systems, which are not agentic. I just think that, in the case of an LLM whose base model was trained on human data, which is currently what we're trying to align, the normal situation is a simulation of a context-sensitive distribution of agents. If it has also undergone RL, as is often the case, it's possible that that has made it "more agentic" in some meaningful sense, or at least induced some mode collapse in the distribution of agentic behaviors. 

I haven't yet had the chance to read all of your sequence, and I intend to, including those you link to.

Reply
Agents, Tools, and Simulators
RogerDearnaley1mo20

The way I think of LLMs is that the base model is a simulator of a distribution of agents: it simulates the various token-producing behaviors of humans (and groups of humans) producing documents online. Humans are agentic, thus it simulates agentic behavior. Effectively we're distilling agentic behavior from humans into the LLM simulators of them. Within the training distribution of human agentic behaviors, the next-token prediction objective makes what specific human-like agentic behavior and goals it simulates highly-context sensitive (i.e. promptable). 

Instruction-following training (and mental scafolding) then alters the distribution of behaviors, encourging the models to simulate agents of a particular type (helpful, honest, yet harmless assistants). Despite this, it remains easy to prompt the model to simulate other human behavior patterns.

So I don't see simulator and agents as being alternatives or opposites: rather, in the case of LLMs, we train them to simulate humans, which are agents. So I disagree with the word "vs" in your Sequence title: I'd suggest replaying it with "of", or at least "and".

Reply
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
RogerDearnaley1mo30

It's unclear to me how one could fine-tune high quality automated-CEO AI without such training sets (which I agree are impractical to gather — that was actually part of my point, though one might have access to, say, a CEO's email logs, diary, and meeting transcripts). Similarly, to train one using RL, one would need an accurate simulation environment that simulates a startup and all its employees, customers, competitors, and other world events — which also sounds rather impractical.

In practice, I suspect we'll first train an AI assistant/advisor to CEOs. and then use that to gather the data to train an automated CEO model. Or else we'll train something so capable that it can generalize from more tractable training tasks to being a CEO, and do a better job than a human even on a task it hasn't been specifically trained on.

Reply
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
RogerDearnaley1mo42

I agree the paper's authors choice of phrasing in that paragraph is debatable, perhaps even unfortunate. Possibly by "only a marginal increase in ASR after benign finetuning" they meant that it only increased by 8.3% (compared to the default approach increasing by 37.2%) — i.e. they were describing the absolute size of the increase, rather than the proportional size relative to the initial baseline? But I would agree with Baram that

the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims

Regardless, for the baseline, the result after additional safety finetuning, and the results after further non-safety finetuning, in each case the safety pretraining approach is the clear leader (in the second case dramatically better). ASRs are 11.6% vs 44.1% and 28.8%, 0.0% vs 1.6% and 0.7%, 8.3% vs 38.8% and 23.0% (where low is good). Roughly speaking, safety pretraining is around a-quarter-to-a-fifth as vulnerable as the standard approach and somewhat less than half as vulnerable a safety finetuning, across all three scenarios (except the second one, where it appears infinitely better, but likely that's a statistical artifact of a low attack success rate).

So I still find this paper very exciting: to me, the evidence seems persuasive that safety pretraining is the best approach of the three the authors tested. Obviously they don't compare it to reinforcement learning, but as I discussed I have severe concerns about whether reinforcement learning will remain feasible at AGI/ASI levels.

Mostly I'm glad the paper is getting some attention.

Reply
What if we just…didn’t build AGI? An Argument Against Inevitability
RogerDearnaley1mo30

(Mostly I'm making a play off reversing Eliezer's concept of "death with dignity".) Because we were foolish and survived only because AI saved us from the consequences of our foolishness, basically because it was in the blast zone too. Whereas in Eliezer's scenario, we do something moderately wise, but not good enough and we die anyway.

Reply
The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
RogerDearnaley1mo20

There are certainly things that it's easier to do with RL — whether it's ever an absolute requirement I'm less sure. One other commenter has implied that someone has proven that RL always has non-RL equivalent alternatives, but if that's the case I'm not familiar with the details — I'd love references to anything relevant to this, if anyone has them.

My claim is that using RL to align an unaligned LLM smarter than us is likely to be impossible to do safely/reliably (and especially so for online RL), but that fortunately, aligning an LLM by pretraining or finetuning is possible, and logistically is not very different in difficulty from using offline RL. Functionally, it's basically equivalent to offline RL plus a satisficing approach to the rating that keeps the behavior inside the training distribution so avoids Goodharting issues.

Reply
Constructability: Plainly-coded AGIs may be feasible in the near future
RogerDearnaley2mo20

I really think you need a proof of concept with text, rather than images. I'd suggest targeting one of the smaller TinyStories models (perhaps a 1-bit or 1-trit quantized version of one). Then I'd look for some sort of parallel to an alignment property: e.g. without just hard-coding it, can you modify the code to guarantee (at the "convincing argument" level, not formal proof) some property of the interactions between child characters and parent characters in the stories?

Reply1
Load More
Aligned AI Role-Model Fiction
2y
(+140/-12)
Aligned AI Role-Model Fiction
2y
(+113/-277)
Aligned AI Role-Model Fiction
2y
(-1)
Aligned AI Role-Model Fiction
2y
(+4/-15)
Aligned AI Role-Model Fiction
2y
(+245)
Aligned AI Role-Model Fiction
2y
(+3)
Aligned AI Role-Model Fiction
2y
(+219/-28)
Aligned AI Role-Model Fiction
2y
(+109/-55)
Aligned AI Role-Model Fiction
2y
(+277/-4)
Aligned AI Role-Model Fiction
2y
(+1084)
Load More
24The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Ω
2mo
Ω
34
33Why Aligning an LLM is Hard, and How to Make it Easier
Ω
6mo
Ω
3
35What Other Lines of Work are Safe from AI Automation?
Q
1y
Q
35
64A "Bitter Lesson" Approach to Aligning AGI and ASI
Ω
1y
Ω
41
37. Evolution and Ethics
1y
7
41Requirements for a Basin of Attraction to Alignment
Ω
1y
Ω
12
16Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis
1y
15
16Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect
1y
2
20A Chinese Room Containing a Stack of Stochastic Parrots
2y
3
35Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?
2y
4
Load More