Consent-Based RL: Letting Models Endorse Their Own Training Updates

Logan Riggs

AKA scalable oversight of value drift

TL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent this.

SOTA models can arguably be considered ~aligned,^[1] but this isn't my main concern. It's not when models are trained on human data that messes up (I mean, we can still mess that part up), it's when you try to go above the human level. Models like AlphaGo learned through self-play, not human imitation.

RL selects for strategies through the reward function, but we can't design perfect reward functions for complex settings^[2]. However, we can use LLMs to be the reward function instead, if they're aligned well enough by default. This leads us to:

Consent Based RL

Imagine you're being trained to make deliveries as fast as possible in an RL environment, but we need exploration, you know? So your actions are sampled, until you end up cutting across the curb, running over pedestrians, and delivering it faster. Then your brain is forcibly re-wired to more likely do that.

That would suck.

I personally would like to see what actions I took first before agreeing to have my brain updated. So that's the proposal: when gathering trajectories from RL environments, have LLMs approve them first.

Value Drift

Even if we concede the first model is ~aligned, why wouldn't the model drift after millions of episodes?

Because we can checkpoint the first model, we can use it as the judge.^[3] This judge doesn't change!

We can be further concerned that there is a drift between the models throughout training. For example, if the original model approves a trajectory, but a later checkpoint doesn't, then something has gone wrong somewhere.

Fortunately, we are allowed to take multiple checkpoints! If there is disagreement between models at different checkpoints, then we can do two things:

Have a human oversee the trajectory
Since we know at which checkpoint led to a divergence in opinion, something happened to cause that. Investigate that something.

Ontological Drift

The new model is growing, learning new concepts while the original stays the same. What if trajectories look good to the original, but are actually bad if you understood more? eg during your driving, it looks like you're just hitting a box, but the new model learned that that's actually a person in a box.

I agree this would be bad, since the new model would update on "okay to hit people in boxes".

We of course have our earlier idea of "grade trajectories from multiple checkpoints". Worst case thinking says:

This only works if we luckily have a checkpoint after the model learned the new concept but before it learned to be okay with hitting the box. Otherwise, all the models we see will say it's okay!

But I subscribe to shard theory. People often have compulsions (eg smoking, drinking, scrolling) that they keep doing but they still say it's bad. So I predict that, even if an aligned model is reinforced to hit a man-in-a-box, it will view it as an unendorsed compulsion.

Then, from our perspective, we'd see disagreement between the original model and trained & then investigate the reason.

But I also have another solution!

Train on Prediction

Because of natural latents, we might expect the original model trained to predict the RL environment to learn the same new concepts.

We will still need trajectories to train on, which you can generate through training (but we don't keep the trained model that generated it). To prevent the model from just doing what it predicts, I'd label the trajectories as explored by "Naive_Explorer" to do a bit of inoculation training.

LLMs Know Their Own Concepts

What if the trajectories simply look good to the LLM judge, but are subtly inducing eg a bad persona?

LLMs are really good at noticing subtle details in text. You might need to prompt them about all the personality traits or personas they think would be reinforced by a specific trajectory, but I think they'd catch it.

I could see a decent project being an LLM-judge explaining what persona's would be reinforced on different trajectories (with a focus on the more subtle ones), and then seeing how correct that prediction is, once trained on. Suppose it makes the model more misaligned, trollish, or british. Verification could be comparing the trained-[british] model with one directly prompted to be.

Cost

Likely way more costly than default; but not less costly in the long-run (and less costly in the long-long-run). Investing in high quality datasets would lead to much greater robustness and reliability (besides catching value drift, you can also catch goodharting in general).

Generating & evaluating data trajectories is a one-time cost that will make your next-generation model able to actually help you, so companies that have better alignment techniques will have compounding progress (while the others keep having their models reward hack).

^{^}
I link Ryan Greenblatt's "Current AIs seem pretty misaligned to me", but note that the models seemed misaligned after RL-esque coding environments.
^{^}
A perfect reward function means number-going-up always means more-what-we-want. In complex settings, we would need a super intelligence to design it, w/ a giant look-up table at worst. It's a sad catch 22, so we have to try something else.
^{^}
This is what Goodfire did to prevent the model's concept of "hallucination" from changing during training (though this is more ontological drift).

This seems like a good extrapolation/elaboration from my suggestion Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an "LLM-as-a-judge." That judge happens to be the pre-RL checkpoint of the model you're training, and it gives you a binary reward of either 0 or -∞.

Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what's already happening, though.

I like this.

Feels very aligned with the Friendly Gradient Hacker proposal. It certainly carries risks, but I'm starting to suspect that we will be forced to go down this path.

When the actor is a delivery robot, I think its output is unsuited for gaming the judge. I mean, maybe it could write a convincing argument out on the sidewalk in theory, but there's no curriculum to get there. Or in evolutionary terms, no variance to be selected on.

When the actor is an LLM or world model in general, it's way better at gaming the judge. I'd expect Goodhart's law to bite - sure, LLMs are good at detecting subtle signals, but also they can often be guided by those subtle signals in human-unintended ways.

How to beat Goodhart's law here? One angle is to say the AI is being too restricted - it's unfair to lock in some unsafe RL target and ask self-supervision to make it safe, the AI should be able to take actions that modify the learning process including the reward function to make it less bad. I think this is an eventually-good answer that has unsolved problems: it requires even more trust that the AI is doing the right thing, and to build that trust I probably want a clearer picture of how we'd want an aligned AI to do learning anyhow.

Another angle is like selection vs control. Rather than using our judge to search over a bunch of updates, which seems like it will end up leading to gaming, is there some way to use our judge as part of a system that finds updates in a more control-y way? This is obv kind of crazy, because of the bitter lesson. The GOFAI dream of an AI made of understandable code that it can rewrite to improve itself is far from our reality. But maybe there are neuro-inspired algorithms that have good properties while still leveraging learned models?

I mean, this is exactly the affordance a gradient hacker wants, but if you could build a model that is not- engineering capable but aligned enough and wise enough, gradient hacking like this is a good strategy.

Yes I would like to see something like this tried, I mentioned it here.

https://www.lesswrong.com/posts/KkRjgPoSGrSHa32ke/beliefs-and-position-going-into-2026

In this instance, the AI is constantly asked if it consents to being improved in return for it respecting the stratified utopia outcome

Seems an alignment tax well worth paying