Review

Edit: I think this actually implements what I was trying to say:  https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
 

Referencing: https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post

First insight:

Waluigi isn't exactly opposite of Luigi. And I think misbehaving ChatGPT isn't exactly opposite of a helpful ChatGPT. There are many ways to be opposite of helpful. You could: 1) say nothing, 2) say gibberish, 3) say the opposite of everything, 4) lie strategically and a slew of other options.

Second insight:

If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.

For example, ChatGPT can check where it is in the behavior vector space. Then check again a sentence later. If it's moving opposite of that vector (i.e. towards Waluigi) then it's time to backtrack and try again.

Third insight:

The difference likely contains more good things than bad. But two pitfalls are immediately obvious: 1) for some good things there might be a point of optimality past which you'd get worse results (e.g. a very polite AI but one that's not actually helpful in answering your query) and 2) you'd amplify the few bad things contained in the difference.

To the extent the new model continues to exhibit the problem of two behavior modes where one is good and one is not, you can iterate on this process and continue to nudge it in the right direction.

New Comment
9 comments, sorted by Click to highlight new comments since:

Is "behavior vector space" referencing something? If not, what do you mean by it?

I don't think I define it rigorously. Maybe someone with deeper technical understanding of these models could.

But if I had to come up with a hack somehow, you could look at the distribution of probabilities for various words as ChatGPT is predicting the next token. Presumably you'll noticed a certain kind of probability distribution when it's in the "Luigi" mode and another when it's in "Waluigi" mode. Then prodding it in the right direction might be weighing more the tokens that are a lot more frequent in the Luigi mode than Waluigi.

Second insight:
If you can find Luigi and Waluigi in the behavior vector space, then you have a helpful direction to nudge the AI towards. You nudge it in the direction of Luigi - Waluigi.

You need to do this for all (x,y) pairs of Luigis and Waluigis. How do you enumerate all the good things in the world with their evil twins, and then somehow compare the internal embedding shift against all of these directions? Is that even feasible? You probably would just get stuck.

I don’t think the problem is this big if you’re trying to control one specific model. Given an RLHF’d model, equipped with a specific system prompt (e.g. helpless harmless assistant), you have either one or a small number of luigis, and therefore around the same amount of waluigis - right?

Note that GPT-4 is not a particular simulacrum like ChatGPT-3.5, but a prompt-conditioned simulacrum generator. And this is likely post-RLHF GPT-4.

What about the luigis and waluigis in different languages, cultures, religions? Ones that can be described via code? It feels like you can always invent new waluigis unless the RLHF killed all of the waluigis from your pre-training data (whatever that means) 

The token limit (let's call that ) is your limit here, you just need to create a waluigi in  steps, so that you can utilize him for the last  steps. I think this eventually breaks down to something about computational bounds, like can you create a waluigi in this much time

I thought the point was that for every Superluigi there is a SuperWaluigi. Doesn't that make this approach flawed?