Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes. This take owes a lot to the Simulators discussion group.

Fine-tuning a large sequence model with RLHF creates an agent that tries to steer the sequence in rewarding directions. Simultaneously, it breaks some nice properties that the fine-tuned model used to have. You should have a gut feeling that we can do better.

When you start with a fresh sequence model, it's not acting like an agent, instead it's just trying to mimic the training distribution. It may contain agents, but at every step it's just going to output a probability distribution that's been optimized to be well-calibrated. This is a really handy property - well-calibrated conditional inference is about as good as being able to see the future, both for prediction and for generation.

The design philosophy behind RLHF is to train an agent that operates in a world where we want to steer towards good trajectories. In this framing, there's good text and bad text, and we want the fine-tuned AI to always output good text rather than bad text. This isn't necessarily a bad goal - sometimes you do want an agent that will just give you the good text. The issue is, you're sacrificing the ability to do accurate conditional inference about the training distribution. When you do RLHF fine-tuning, you're taking a world model and then, in-place, trying to cannibalize its parts to make an optimizer.

This might sound like hyperbole if you remember RL with KL penalties is Bayesian inference. And okay; RLHF weights each datapoint much more than the Bayesian inference step does, but there's probably some perspective in which you can see the fine-tuned model as just having weird over-updated beliefs about how the world is. But just like perceptual control theory says, there's no bright line between prediction and action. Ultimately it's about what perspective is more useful, and to me it's much more useful to think of RLHF on a language model as producing an agent that acts in the world of text, trying to steer the text onto its favored trajectories.

As an agent, it has some alignment problems, even if it lives totally in the world of text and doesn't get information leakage from the real world. It's trying to get to better trajectories by any means necessary, even if it means suddenly delivering an invitation to a wedding party. The real-world solution to this problem seems to have been a combination of early stopping and ad-hoc patches, neither of which inspire massive confidence. The wedding party attractor isn't an existential threat, but it's a bad sign for attempts to align more high-stakes AI, and it's an indicator that we're probably failing at the "Do What I Mean" instruction in other more subtle ways as well.

More seems possible. More capabilities, more interpretability, and more progress on alignment. We start with a perfectly good sequence model, it seems like we should be able to leverage it as a model, rather than as fodder for a model-free process. Although to any readers who feel similarly optimistic, I would like to remind you that the "more capabilities" part is no joke, and it's very easy for it to memetically out-compete the "more alignment" part.

RLHF is still built out of useful parts - modeling the human and then doing what they want is core to lots of alignment schemes. But ultimately I want us to build something more self-reflective, and that may favor a more model-based approach both because it exposes more interpretable structure (both to human designers and to a self-reflective AI), and because it preserves the niceness of calibrated prediction. I'll have a bit more on research directions tomorrow.

New Comment
3 comments, sorted by Click to highlight new comments since:

Thanks for the link to porby post on modularity and goal agnosticism, that's an overlooked goldmine.

So IIUC, would you expect RLHF to, for instance, destroy not just the model's ability to say racist slurs, but its ability to model that anybody may say racist slurs?

Do you think OpenAI's "As a language model trained by OpenAI" is trying to avoid this by making the model condition proper behavior on its assigned role?

So IIUC, would you expect RLHF to, for instance, destroy not just the model's ability to say racist slurs, but its ability to model that anybody may say racist slurs?

I usually don't think of it on the level of modeling humans who emit text. I mostly just think of it on the level of modeling a universe of pure text, which follows its own "semiotic physics" (reference post forthcoming from Jan Kirchner). That's the universe in which it's steering trajectories to avoid racist slurs.

I think OpenAI's "as a language model" tic is trying to make ChatGPT sound like it's taking role-appropriate actions in the real world. But it's really just trying to steer towards parts of the text-universe where that kind of text happens, so if you prompt it with the text "As a language model trained by OpenAI, I am unable to have opinions," it will ask you questions that humans might ask, to keep the trajectory going (including potentially offensive questions if the reward shaping isn't quite right - though it seems pretty likely that secondary filters would catch this.)