I like this.
Feels very aligned with the Friendly Gradient Hacker proposal. It certainly carries risks, but I'm starting to suspect that we will be forced to go down this path.
This seems like a good extrapolation/elaboration from my suggestion Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?
Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an "LLM-as-a-judge." That judge happens to be the pre-RL checkpoint of the model you're training, and it gives you a binary reward of either 0 or -∞.
Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what's already happening, though.
I mean, this is exactly the affordance a gradient hacker wants, but if you could build a model that is not- engineering capable but aligned enough and wise enough, gradient hacking like this is a good strategy.
Yes I would like to see something like this tried, I mentioned it here.
https://www.lesswrong.com/posts/KkRjgPoSGrSHa32ke/beliefs-and-position-going-into-2026
In this instance, the AI is constantly asked if it consents to being improved in return for it respecting the stratified utopia outcome
Seems an alignment tax well worth paying
AKA scalable oversight of value drift
TL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates,
wethey can prevent this.SOTA models can arguably be considered ~aligned,[1] but this isn't my main concern. It's not when models are trained on human data that messes up (I mean, we can still mess that part up), it's when you try to go above the human level. Models like AlphaGo learned through self-play, not human imitation.
RL selects for strategies through the reward function, but we can't design perfect reward functions for complex settings[2]. However, we can use LLMs to be the reward function instead, if they're aligned well enough by default. This leads us to:
Consent Based RL
Imagine you're being trained to make deliveries as fast as possible in an RL environment, but we need exploration, you know? So your actions are sampled, until you end up cutting across the curb, running over pedestrians, and delivering it faster. Then your brain is forcibly re-wired to more likely do that.
That would suck.
I personally would like to see what actions I took first before agreeing to have my brain updated. So that's the proposal: when gathering trajectories from RL environments, have LLMs approve them first.
Value Drift
Even if we concede the first model is ~aligned, why wouldn't the model drift after millions of episodes?
Because we can checkpoint the first model, we can use it as the judge.[3] This judge doesn't change!
We can be further concerned that there is a drift between the models throughout training. For example, if the original model approves a trajectory, but a later checkpoint doesn't, then something has gone wrong somewhere.
Fortunately, we are allowed to take multiple checkpoints! If there is disagreement between models at different checkpoints, then we can do two things:
Ontological Drift
The new model is growing, learning new concepts while the original stays the same. What if trajectories look good to the original, but are actually bad if you understood more? eg during your driving, it looks like you're just hitting a box, but the new model learned that that's actually a person in a box.
I agree this would be bad, since the new model would update on "okay to hit people in boxes".
We of course have our earlier idea of "grade trajectories from multiple checkpoints". Worst case thinking says:
But I subscribe to shard theory. People often have compulsions (eg smoking, drinking, scrolling) that they keep doing but they still say it's bad. So I predict that, even if an aligned model is reinforced to hit a man-in-a-box, it will view it as an unendorsed compulsion.
Then, from our perspective, we'd see disagreement between the original model and trained & then investigate the reason.
But I also have another solution!
Train on Prediction
Because of natural latents, we might expect the original model trained to predict the RL environment to learn the same new concepts.
We will still need trajectories to train on, which you can generate through training (but we don't keep the trained model that generated it). To prevent the model from just doing what it predicts, I'd label the trajectories as explored by "Naive_Explorer" to do a bit of inoculation training.
LLMs Know Their Own Concepts
What if the trajectories simply look good to the LLM judge, but are subtly inducing eg a bad persona?
LLMs are really good at noticing subtle details in text. You might need to prompt them about all the personality traits or personas they think would be reinforced by a specific trajectory, but I think they'd catch it.
I could see a decent project being an LLM-judge explaining what persona's would be reinforced on different trajectories (with a focus on the more subtle ones), and then seeing how correct that prediction is, once trained on. Suppose it makes the model more misaligned, trollish, or british. Verification could be comparing the trained-[british] model with one directly prompted to be.
Cost
Likely way more costly than default; but not less costly in the long-run (and less costly in the long-long-run). Investing in high quality datasets would lead to much greater robustness and reliability (besides catching value drift, you can also catch goodharting in general).
Generating & evaluating data trajectories is a one-time cost that will make your next-generation model able to actually help you, so companies that have better alignment techniques will have compounding progress (while the others keep having their models reward hack).
I link Ryan Greenblatt's "Current AIs seem pretty misaligned to me", but note that the models seemed misaligned after RL-esque coding environments.
A perfect reward function means number-going-up always means more-what-we-want. In complex settings, we would need a super intelligence to design it, w/ a giant look-up table at worst. It's a sad catch 22, so we have to try something else.
This is what Goodfire did to prevent the model's concept of "hallucination" from changing during training (though this is more ontological drift).