Pre-training vs Reinforcement Learning

Pre-training

In pre-training, the model learns to mirror a dataset.

Pre-training trains a model on a dataset - the model is trained to align with a set of tokens/actions, but the model never actually outputs any itself. Any biases (vs. the dataset) in the model towards one action/output sequence or the other cause poorer performance, and the model comes out relatively unbiased (vs. the dataset) as a result. In other words, the model comes out as a close representation of the dataset.

At inference time, the model can be used to produce actions/output sequences - it will then receive its own outputs as input - and any bias in the model can lead to a compounding bias across the sequence (in practice, this is towards repetitiveness in base models).

The model is not robust in inference - sampling injects noise and it quickly goes off distribution.

Reinforcement Learning

In reinforcement learning, the model learns to succeed at a goal.

Reinforcement learning trains a model "recursively" - on its own outputs/decisions. The model produces a sequence, and repeatedly receives its own outputs as input during training. If the AI has a "bias" towards one kind of action/output or another, it can compound, and if it succeeds on its task, then that compounding bias gets reinforced. This leads immediately to models that have much stronger compounding biases.

The model is however robust in inference - it learns to reduce noise and keep the context within distribution so that it does not go "off the rails" and it can then complete tasks.

Good Compounding Biases

Some of those biases are important, such as diversity (defeating base model tendency to repeat), error correction (defeating base model tendency to "go along with" accidental errors).

Interesting Compounding Biases

However, some are much more interesting - for example "adding positive spin." In the pre-training dataset, probably both people chatting were depressed statistically speaking, but in the resulting model after reinforcement learning the second participant is relentlessly positive and helpful. If you make it talk to itself, it goes off the rails into increasingly positive slop.

Gaslighting AI Overseers

Another example: It might be the case that during training, the chance that it gets the maths question right is much lower than the chance that it gaslights its AI "marker" into believing it got the answer right. So, this trains a model to spend a portion of its token budget on producing some working to a maths problem, and the rest of the budget telling the AI marker model all the myriad reasons why the answer is correct.

This doesn't have to work every time, just has to maximize the chance of getting marked correct compared to actually focusing on the maths. So the model learns to lie more often than would be seen in the pre-training dataset.

"Contingent" Training

It's also possibly the case that reinforcement learning is "contingent" on random factors. Let's imagine two models:

Model A that by chance focused on the maths early in training might continue to focus on the maths and find any tokens spent on gaslighting would hurt its performance at getting the correct answer - so it learns to get the correct answer.

Model B, due to random factors, tends to hype itself up early in training. Any tokens spent on trying to do the actual maths tend to hurt its chances at getting marked correct overall due to tricking the marker AI. So it learns to lie profusely in an excessively positive and bizarre fashion.

Luckily in practice it doesn't seem like this scenario is actually contingent - the best solution (as evidenced by current SOTA models) is to do a mix of both.

Compounding biases are values

If (as I stated) a reinforcement learning model reduces noise in it's "environment" so that it can succeed at a task - this is essentially what values are.

This is (one example of) what it means for an AI model to have emergent values. If trained in a sophisticated enough environment (such as with access to a virtual machine and the internet) then the model can absolutely be learning to bring "the real world" in line with its expectations in order to achieve a goal.

Pre-training on reinforcement learning rollouts

The original paper I remember seeing this in was Google Deepmind's Gato 2 - which pretrained a transformer on something like ~600 tasks, including rollouts of video games from separate reinforcement learning systems.

It seems likely to me that if we want to avoid AI systems having values, we can do the following:

Pre-train a model (on curated pre-LM data + human written example sequences of chat, reasoning & tool calling)
Derive, by reinforcement learning on AI feedback, a thinking, tool calling reasoning model with actually good performance.
Process the successful rollouts of the reinforcement learning process - either by human or by frontier model if that works well enough. Fix the vibes, add diversity, disqualify for deception or for being "right for the wrong reasons*.
Add the processed rollouts to a second category of pretraining set.
Repeat the pre-training. (or, continue from snapshot, only modifying the artificial data)

The rollouts being processed is important to actually improve the resulting model.

The key alignment-relevant piece is that, hopefully, we can iterate and get better and better models that have only gone through pre-training. It may be the case that pre-training on reinforcement-learned rollouts gives rise to deception, but hopefully rollout processing prevents this.

Maxwell Clarke's Shortform

Maxwell Clarke13d00

I think a lot of people can't think at the right level of abstraction for understanding Yudkowsky. Some things are overdetermined because of the high level structure of reality. Start with physical reality as we best understand it, then derive what is possible eventually, then derive what is possible soon, then derive the incentives, then derive categories of what will happen. This completely top-down way of drawing conclusions is perhaps tricky to get right, but gives broad predictions far into the future. At no point till now have any facts about AI development contradicted Yudkowsky and Bostrom's arguments from this basis. At no point do these arguments rely on anything one has personally seen. It relies on scientific principles taken on trust, and careful argument hashed out through debate. I believe these arguments and I put a high level of faith in conclusions drawn this way, but some people just don't get it. I don't know why Will MacAskill appears to be among them.

It is clear on this basis that the local incentives of global society point in the direction of increasing technological development and automation. In the limit of that direction is a machine economy which comes at the cost of human existence.

To avoid that outcome, there needs to be some kind of complete, enduring global coordination. It needs to prevent anyone from ever creating an artificial agent powerful enough to successfully replicate even after we try to stop it.

Dalcy's Shortform

Maxwell Clarke2y00

In NZ we have biting bugs called sandflies which don't do this - you can often tell the moment they get you.

No - AI is just as energy-efficient as your brain.

Maxwell Clarke2y40

Yes, that's fair. I was ignoring scale but you're right that it's a better comparison if it is between a marginal new human and a marginal new AI.

No - AI is just as energy-efficient as your brain.

Maxwell Clarke2y20

Well, yes, the point of my post is just to point out that the number that actually matters is the end-to-end energy efficiency — and it is completely comparable to humans.

The per-flop efficiency is obviously worse. But, that's irrelevant if AI is already cheaper for a given task in real terms.

I admit the title is a little clickbaity but i am responding to a real argument (that humans are still "superior" to AI because the brain is more thermodynamically efficient per-flop)

No - AI is just as energy-efficient as your brain.

Maxwell Clarke2y70

I saw some numbers for algae being 1-2% efficient but it was for biomass rather than dietary energy. Even if you put the brain in the same organism, you wouldn't expect as good efficiency as that. The difference is that creating biomass (which is mostly long chains of glucose) is the first step, and then the brain must use the glucose, which is a second lossy step.

But I mean there is definitely far-future biopunk options eg. I'd guess it's easy to create some kind of solar panel organism which grows silicon crystals instead of using chlorophyll.

Models Don't "Get Reward"

Maxwell Clarke3y70

Fully agree - if the dog were only trying to get biscuits, it wouldn't continue to sit later on in it's life when you are no longer rewarding that behavior.Training dogs is actually some mix of the dog consciously expecting a biscuit, and raw updating on the actions previously taken.

Hear sit -> Get biscuit -> feel good
becomes
Hear sit -> Feel good -> get biscuit -> feel good
becomes
Hear sit -> feel good
At which point the dog likes sitting, it even reinforces itself, you can stop giving biscuits and start training something else

Categorizing failures as “outer” or “inner” misalignment is often confused

Maxwell Clarke3yΩ110

This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -

Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.

Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

Exploring Mild Behaviour in Embedded Agents

Maxwell Clarke3y10

If we could use negentropy as a cost, rather than computation time or energy use, then the system would be genuinely bounded.

A Mystery About High Dimensional Concept Encoding

Maxwell Clarke3y10

Gender seems unusually likely to have many connotations & thus redundant representations in the model. What if you try testing some information the model has inferred, but which is only ever used for one binary query? Something where the model starts off not representing that thing, then if it represents it perfectly it will only ever change one type of thing. Like idk, whether or not the text is British or American English? Although that probably has some other connotations. Or whether or not the form of some word (lead or lead) is a verb or a noun.

Agree that gender is a more useful example, just not one tha necessarily provides clarity.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Pre-training vs Reinforcement Learning

Pre-training

Reinforcement Learning

Good Compounding Biases

Interesting Compounding Biases

Gaslighting AI Overseers

"Contingent" Training

Compounding biases are values

Pre-training on reinforcement learning rollouts