Maxwell Clarke's Shortform

12th Aug 2025

1 min read

2

This is a special post for quick takes by Maxwell Clarke. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

3 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:45 AM

[-]Maxwell Clarke3mo10

Pre-training vs Reinforcement Learning

Pre-training

In pre-training, the model learns to mirror a dataset.

Pre-training trains a model on a dataset - the model is trained to align with a set of tokens/actions, but the model never actually outputs any itself. Any biases (vs. the dataset) in the model towards one action/output sequence or the other cause poorer performance, and the model comes out relatively unbiased (vs. the dataset) as a result. In other words, the model comes out as a close representation of the dataset.

At inference time, the model can be used to produce actions/output sequences - it will then receive its own outputs as input - and any bias in the model can lead to a compounding bias across the sequence (in practice, this is towards repetitiveness in base models).

The model is not robust in inference - sampling injects noise and it quickly goes off distribution.

Reinforcement Learning

In reinforcement learning, the model learns to succeed at a goal.

Reinforcement learning trains a model "recursively" - on its own outputs/decisions. The model produces a sequence, and repeatedly receives its own outputs as input during training. If the AI has a "bias" towards one kind of action/output or another, it can compound, and if it succeeds on its task, then that compounding bias gets reinforced. This leads immediately to models that have much stronger compounding biases.

The model is however robust in inference - it learns to reduce noise and keep the context within distribution so that it does not go "off the rails" and it can then complete tasks.

Good Compounding Biases

Some of those biases are important, such as diversity (defeating base model tendency to repeat), error correction (defeating base model tendency to "go along with" accidental errors).

Interesting Compounding Biases

However, some are much more interesting - for example "adding positive spin." In the pre-training dataset, probably both people chatting were depressed statistically speaking, but in the resulting model after reinforcement learning the second participant is relentlessly positive and helpful. If you make it talk to itself, it goes off the rails into increasingly positive slop.

Gaslighting AI Overseers

Another example: It might be the case that during training, the chance that it gets the maths question right is much lower than the chance that it gaslights its AI "marker" into believing it got the answer right. So, this trains a model to spend a portion of its token budget on producing some working to a maths problem, and the rest of the budget telling the AI marker model all the myriad reasons why the answer is correct.

This doesn't have to work every time, just has to maximize the chance of getting marked correct compared to actually focusing on the maths. So the model learns to lie more often than would be seen in the pre-training dataset.

"Contingent" Training

It's also possibly the case that reinforcement learning is "contingent" on random factors. Let's imagine two models:

Model A that by chance focused on the maths early in training might continue to focus on the maths and find any tokens spent on gaslighting would hurt its performance at getting the correct answer - so it learns to get the correct answer.

Model B, due to random factors, tends to hype itself up early in training. Any tokens spent on trying to do the actual maths tend to hurt its chances at getting marked correct overall due to tricking the marker AI. So it learns to lie profusely in an excessively positive and bizarre fashion.

Luckily in practice it doesn't seem like this scenario is actually contingent - the best solution (as evidenced by current SOTA models) is to do a mix of both.

Compounding biases are values

If (as I stated) a reinforcement learning model reduces noise in it's "environment" so that it can succeed at a task - this is essentially what values are.

This is (one example of) what it means for an AI model to have emergent values. If trained in a sophisticated enough environment (such as with access to a virtual machine and the internet) then the model can absolutely be learning to bring "the real world" in line with its expectations in order to achieve a goal.

Pre-training on reinforcement learning rollouts

The original paper I remember seeing this in was Google Deepmind's Gato 2 - which pretrained a transformer on something like ~600 tasks, including rollouts of video games from separate reinforcement learning systems.

It seems likely to me that if we want to avoid AI systems having values, we can do the following:

Pre-train a model (on curated pre-LM data + human written example sequences of chat, reasoning & tool calling)
Derive, by reinforcement learning on AI feedback, a thinking, tool calling reasoning model with actually good performance.
Process the successful rollouts of the reinforcement learning process - either by human or by frontier model if that works well enough. Fix the vibes, add diversity, disqualify for deception or for being "right for the wrong reasons*.
Add the processed rollouts to a second category of pretraining set.
Repeat the pre-training. (or, continue from snapshot, only modifying the artificial data)

The rollouts being processed is important to actually improve the resulting model.

The key alignment-relevant piece is that, hopefully, we can iterate and get better and better models that have only gone through pre-training. It may be the case that pre-training on reinforcement-learned rollouts gives rise to deception, but hopefully rollout processing prevents this.

Reply

[-]Maxwell Clarke13d00

I think a lot of people can't think at the right level of abstraction for understanding Yudkowsky. Some things are overdetermined because of the high level structure of reality. Start with physical reality as we best understand it, then derive what is possible eventually, then derive what is possible soon, then derive the incentives, then derive categories of what will happen. This completely top-down way of drawing conclusions is perhaps tricky to get right, but gives broad predictions far into the future. At no point till now have any facts about AI development contradicted Yudkowsky and Bostrom's arguments from this basis. At no point do these arguments rely on anything one has personally seen. It relies on scientific principles taken on trust, and careful argument hashed out through debate. I believe these arguments and I put a high level of faith in conclusions drawn this way, but some people just don't get it. I don't know why Will MacAskill appears to be among them.

It is clear on this basis that the local incentives of global society point in the direction of increasing technological development and automation. In the limit of that direction is a machine economy which comes at the cost of human existence.

To avoid that outcome, there needs to be some kind of complete, enduring global coordination. It needs to prevent anyone from ever creating an artificial agent powerful enough to successfully replicate even after we try to stop it.

Reply

[-]Vladimir_Nesov12d20

Understanding an argument and agreeing with it are different things. So you might be right that there is some legible reason for the majority of misunderstandings, but it doesn't follow that understanding the argument (overcoming that reason for misunderstanding) implies agreement. Some reasons for disagreement are not about misunderstanding of the intended meaning.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Maxwell Clarke's Shortform

2

Pre-training vs Reinforcement Learning

Pre-training

Reinforcement Learning

Good Compounding Biases

Interesting Compounding Biases

Gaslighting AI Overseers

"Contingent" Training

Compounding biases are values

Pre-training on reinforcement learning rollouts