Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’m trying out making a few posts with less polish and smaller scope, to iterate more quickly on my thoughts and write about some interesting ideas in isolation before having fully figured them out. Expect middling confidence in any conclusions drawn, and occasionally just chains of reasoning without fully contextualized conclusions.

I figured a good place to start would be expanding slightly on the content in this comment of mine. As far as I know it’s not a common frame and on further thought feel like there’s a fair amount of potential in it, although it’s possible it’s old news or explains too much.

In Mysteries of mode collapse, Janus points out that if you ask GPT-3 (specifically, text-davinci-002) whether bugs are real, you often get something like this:

This definitely doesn’t seem like the kind of answer you’d generally come across in the real world to a question like this. And this isn’t just cherry-picked, it comprises a whole bunch of responses that GPT can give here, with similar degrees of prevarication.

Even weirder is that GPT seems really sure of itself most of the time here. Compare this to the old davinci model, for example:

You can’t really explain this away as the model being surer of better answers either, because those aren’t better answers (i.e., more representative of our world)!

That last part is a good spot to start talking about what I view as the crux here. That first generation isn’t representative of our world, so what world is it representative of? What world prior is text-davinci-002 operating on?

Shifting the world prior

Before we get to that, a segue into what’s different between the two models here: davinci is the vanilla pre-trained model we’ve come to know and love, and text-davinci-002 is a base model that’s been fine-tuned further on “high-quality” texts that represent what OpenAI wants GPT-3 to be - uncontroversial, polite, etc.

So if we’re viewing GPT as having a learned distribution over worlds similar to ours (the class of worlds described by the training data, weighted by likelihood to form its world prior), how does this fine-tuning interact with it?

Well it’s new data, right? So we could just view it as more information to GPT on what kind of world it should simulate. If you fill your fine-tuning corpus with a bunch of text that isn’t really how humans sound in the real world - with overt politeness, nicety, neutrality, and the like - you end up shifting GPT’s world prior. And importantly, this isn’t happening in a way we explicitly design; we can only hazard guesses about the nature of the worlds contained in this new distribution.

In other words, the fine-tuning data "represents" in an information-theoretic sense a signal for worlds where that data is much more common than in our worlds. If GPT's pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that's much more likely to occur, and we’d probably end up with a very similar model.

Downstream effects

The attractor states described in the original post seem like they're simply highly-probability properties of these resultant worlds.

Adversarial/unhinged/whatever interactions are unlikely in this new distribution of worlds, so you get weird (to us, or rather, to our world) generations like people leaving conversations with high potential for adversarial content as soon as they can because that's more likely (on the high prior conditional of low adversarial content) than the conversation suddenly becoming placid. 

Some questions just shallowly match to controversial and the likely response in those worlds is to equivocate.

When you try to inject text in the middle of a generation like the ones pictured above and GPT’s response is to pull away pretty quickly and move back to prevaricating, it feels like there’s some weird new thing that GPT has learned from the fine-tuning process.

When thinking about it in this frame however, it seems just like what GPT’s been doing all along - injecting input that's pretty unlikely for that world should still lead back to states that are likely for that world.  In my view, that's like if we introduced a random segue in the middle of a wedding toast prompt of the form "you are a murderer", and it still bounces back to being wholesome:

There is an additional factor because of the fine-tuning though. Because this new data was weighted heavily, the world prior wasn’t just shifted, it also narrowed. GPT’s worldspace has become constricted, and the simulator has become surer of the world it should simulate. Consequently, you get behaviour downstream of this like a stronger tendency toward the new worlds that the model finds likely. You can see this on display in the original post, when GPT ends a story to start a new one.

I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn't feel all that qualitatively different from what happens in current models, barring this stronger tendency. I would definitely expect that if we could come up with a story that was sufficiently out of distribution of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its "right" worldspace.

I think that this kind of narrowing is slight evidence for how we could end up with malign worlds (for example, having superintelligent simulacra) from strong fine-tuning, but it doesn't feel like it's that surprising from within the simulator framing.

GPT also has some weird behaviour with generating RNGs, as seen in the original post (here, near the end of this post, seems like a good place to mention that reading that post before this one might be helpful, and not just because it would help contextualize this one). My reasoning here is slightly shaky (even under the standards of an ASoT), but I think this can be seen as the outcome of making the model more confident about the world it's simulating, because of the worldspace restriction from the fine-tuning. It's plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this, and this not being universal (simulating a Python interpreter gets it to work like we want) seems explainable under this - there's no reason why all potential abstractions would be affected.

(One of the posts I want to write in this style is about how there’s a lot of weird seemingly counter-intuitive stuff that arises from GPT’s hierarchy of abstractions being what it is).

Finally, consider why increasing the temperature doesn’t affect these properties all that much. Or why the most likely continuations are still reasonable, even if their probabilities have been scaled to near 0. If GPT starts from the current (approximating true to our world) world prior, and selectively amplifies the continuations that are more likely under the reward model's worlds, this is what you would expect to see. Its definition of "plausible" has shifted; and it doesn't really have cause to shift around any unamplified continuations all that much.

How this works with RL

The latest GPT-3 model, text-davinci-003, was trained with code-davinci-002 as a base, using RLHF with PPO. Given that text-davinci-002 was trained with supervised fine-tuning instead, the obvious question is how this changes with RL.

My tentative position is that it doesn’t change a whole lot for a long time. As pointed out in the original post, RLHF historically also exhibits attractors and mode collapse of the kind we’ve been talking about, and it just so happens that supervised fine-tuning exhibits the same downstream properties.

It’s really hard to quantify the comparisons here though, and I have no idea what differences models trained on equal levels of RLHF and supervised fine-tuning will have, whether in the scale of these properties or in sub-property differences (it wouldn’t surprise me if the two had slightly different attractor states for the same data, for example).

A larger concern to me however, is that there may be properties induced in GPT by RLHF by virtue of the RL paradigm. At the limit there’s a case to be made that strong enough RL on these nice myopic non-optimizers can make them non-myopic optimizers.

I think we may see differences even earlier than this though. My picture here is still too vague to make very concrete predictions, however, and the most I have are disconnected ideas, such as that RL inherently kills entropy and that manifests in some deeper way at some point than just restricted worldspaces, or that the RL mechanism could induce agency in forms that don’t seem very intuitive in the supervised fine-tuning paradigm. One example of the latter could be learning an agentic “head” on top of the generative model (as weird as this sounds, I think this could describe some minds in the real world - humans seem like they approximate to something like this). 

New Comment
8 comments, sorted by Click to highlight new comments since:

I would definitely expect that if we could come up with a story that was sufficiently out of distribution of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in)

Depends on what you mean by story. Not sure what GPT would do if you gave it the output of a random Turing machine. You could also use the state of a random cell inside a cellular automaton as your distribution.

I was thinking of some kind of prompt that would lead to GPT trying to do something as "environment agent-y" as trying to end a story and start a new one - i.e., stuff from some class that has some expected behaviour on the prior and deviates from that pretty hard. There's probably some analogue with something like the output of random Turing machines, but for that specific thing I was pointing at this seemed like a cleaner example.

ASoT

What do you mean by this acronym?  I'm not aware of its being in use on LW, you don't define it, and to me it very definitely (capitalization and all) means Armin van Buuren's weekly radio show A State of Trance.

Alignment Stream of Thought. Sorry, should've made that clearer - I couldn't think of a natural place to define it.

Got it. This post also doesn't appear to actually be part of that sequence though? I would have noticed if it was and looked at the sequence page.

EDIT: Oh, I guess it's not your sequence.

EDIT2: If you just included "Alignment Stream of Thought" as part of the link text in your intro where you do already link to the sequence, that would work.

Yeah, I thought of holding off actually creating a sequence until I had two posts like this. This updates me toward creating one now being beneficial, so I'm going to do that.

That works too!

Done! Thanks for updating me toward this. :P