1037

LESSWRONG
LW

1036

Maxwell Clarke's Shortform

by Maxwell Clarke
12th Aug 2025
1 min read
1

2

This is a special post for quick takes by Maxwell Clarke. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Maxwell Clarke's Shortform
1Maxwell Clarke
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 11:11 PM
[-]Maxwell Clarke1mo10

Pre-training vs Reinforcement Learning

Pre-training

In pre-training, the model learns to mirror a dataset.

Pre-training trains a model on a dataset - the model is trained to align with a set of tokens/actions, but the model never actually outputs any itself. Any biases (vs. the dataset) in the model towards one action/output sequence or the other cause poorer performance, and the model comes out relatively unbiased (vs. the dataset) as a result. In other words, the model comes out as a close representation of the dataset.

At inference time, the model can be used to produce actions/output sequences - it will then receive its own outputs as input - and any bias in the model can lead to a compounding bias across the sequence (in practice, this is towards repetitiveness in base models).

The model is not robust in inference - sampling injects noise and it quickly goes off distribution.

Reinforcement Learning

In reinforcement learning, the model learns to succeed at a goal.

Reinforcement learning trains a model "recursively" - on its own outputs/decisions. The model produces a sequence, and repeatedly receives its own outputs as input during training. If the AI has a "bias" towards one kind of action/output or another, it can compound, and if it succeeds on its task, then that compounding bias gets reinforced. This leads immediately to models that have much stronger compounding biases.

The model is however robust in inference - it learns to reduce noise and keep the context within distribution so that it does not go "off the rails" and it can then complete tasks.

Good Compounding Biases

Some of those biases are important, such as diversity (defeating base model tendency to repeat), error correction (defeating base model tendency to "go along with" accidental errors).

Interesting Compounding Biases

However, some are much more interesting - for example "adding positive spin." In the pre-training dataset, probably both people chatting were depressed statistically speaking, but in the resulting model after reinforcement learning the second participant is relentlessly positive and helpful. If you make it talk to itself, it goes off the rails into increasingly positive slop.

Gaslighting AI Overseers

Another example: It might be the case that during training, the chance that it gets the maths question right is much lower than the chance that it gaslights its AI "marker" into believing it got the answer right. So, this trains a model to spend a portion of its token budget on producing some working to a maths problem, and the rest of the budget telling the AI marker model all the myriad reasons why the answer is correct.

This doesn't have to work every time, just has to maximize the chance of getting marked correct compared to actually focusing on the maths. So the model learns to lie more often than would be seen in the pre-training dataset.

"Contingent" Training

It's also possibly the case that reinforcement learning is "contingent" on random factors. Let's imagine two models:

Model A that by chance focused on the maths early in training might continue to focus on the maths and find any tokens spent on gaslighting would hurt its performance at getting the correct answer - so it learns to get the correct answer.

Model B, due to random factors, tends to hype itself up early in training. Any tokens spent on trying to do the actual maths tend to hurt its chances at getting marked correct overall due to tricking the marker AI. So it learns to lie profusely in an excessively positive and bizarre fashion.

Luckily in practice it doesn't seem like this scenario is actually contingent - the best solution (as evidenced by current SOTA models) is to do a mix of both.

Compounding biases are values

If (as I stated) a reinforcement learning model reduces noise in it's "environment" so that it can succeed at a task - this is essentially what values are.

This is (one example of) what it means for an AI model to have emergent values. If trained in a sophisticated enough environment (such as with access to a virtual machine and the internet) then the model can absolutely be learning to bring "the real world" in line with its expectations in order to achieve a goal.

Pre-training on reinforcement learning rollouts

The original paper I remember seeing this in was Google Deepmind's Gato 2 - which pretrained a transformer on something like ~600 tasks, including rollouts of video games from separate reinforcement learning systems.

It seems likely to me that if we want to avoid AI systems having values, we can do the following:

  1. Pre-train a model (on curated pre-LM data + human written example sequences of chat, reasoning & tool calling)
  2. Derive, by reinforcement learning on AI feedback, a thinking, tool calling reasoning model with actually good performance.
  3. Process the successful rollouts of the reinforcement learning process - either by human or by frontier model if that works well enough. Fix the vibes, add diversity, disqualify for deception or for being "right for the wrong reasons*.
  4. Add the processed rollouts to a second category of pretraining set.
  5. Repeat the pre-training. (or, continue from snapshot, only modifying the artificial data)

The rollouts being processed is important to actually improve the resulting model.

The key alignment-relevant piece is that, hopefully, we can iterate and get better and better models that have only gone through pre-training. It may be the case that pre-training on reinforcement-learned rollouts gives rise to deception, but hopefully rollout processing prevents this.

Reply
Moderation Log
More from Maxwell Clarke
View more
Curated and popular this week
1Comments