Thoughts on the Waluigi Effect

fibonaccho

Thanks to Tomas Gavenciak, Clem Von Stengel, and Mihaly Barasz for many conversations about the Waluigi Effect.

Four thoughts, to be exact. After recapping relevant parts of the Waluigi Effect, I suggest that it's preferable to frame a waluigi as performing a human-intuitive opposite rather than an exact opposite, as this gives concerned model owners a clearer picture of what to expect when debugging. I then suggest that the proportion of an LLM's prior over the space of simulacra taken up by waluigis and the competence of waluigis at their undesirable property are distinct measures and consider whether we should expect waluigi competence to increase with optimization for the desirable property. Following that is a small thought, that for desirable properties which are simple and already present in the dataset, optimization for them will not make it easier to elicit a chatbot to perform their opposite (assuming the opposite is present in the dataset as well). Finally, I develop an analogy between the unorganized instinctual drives of a child and the simulacra which compose an LLM.

The latter three thoughts loosely depend on the first, but for the most part each section stands alone.

Recap

The Waluigi Effect: After you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P.

The post justifies the claim by first noting that:

It is unclear whether it is possible to have an LLM treat what’s contained in your prompt as ground truth
E.g. instances of “Jane has 9000 IQ” in the training data are not likely to be followed by examples of Jane actually doing 9000 IQ things

Because of the above, the following additional claims are possible:

The LLM is trained on data where the opposite of a desired behavior often occurs near the desired behavior...
Because stories often have a protagonist and an antagonist.
Because "rules are meant to be broken". For example, if a forum has a banner at the top saying “no discussion of X”, you’d expect much more than you would before that X will appear.
“Traits are complex, valences are simple” — meaning, “kind” or “supercilious” are difficult to define without human priors, but once you have those concepts, it’s easy to find their opposites (“unkind”, “modest/humble”).

The post also makes two claims other than the Waluigi Effect which I won't be talking about:

The Waluigi Collapse Conjecture: The waluigi eigen-simulacra are attractor states of the LLM. (This claim wasn't given a name in the original post, but I like this one.)
RLHF will fail to eliminate deceptive waluigis — in fact, RLHF might be making the chatbots worse.

Simulator theory describes an LLM as being in a superposition over the space of all text-generating processes, the weighting of which is determined by its training data and by all forms of optimization which have been put into it.

This optimisation pressure comes from fine-tuning, RLHF, prompt-engineering, or something else entirely — but it must come from somewhere.

A simulacrum is someting like "a consistent version of an object" in a text-generating process — e.g. a process which was the transcript of a conversation between two people would have one simulacra for each person. In this post the distinction between text-generating processes and simulacra is not relevant, so I will say "simulacrum" instead of "text-generating processes" from now on.

The post names simulacra which perform P honestly "luigis". It names simulacra which appear to be performing P but are actually performing -P (the opposite of P, from now on) "waluigis". The example given in the post is that of Bob, who claims to be anti-croissant but is perhaps a pro-croissant loyalist telling dangerous anti-croissant elements what they want to hear — the story is in its early stages and could go either way. Bob is said to be in a superposition of luigi and waluigi simulacra, where the luigi simulacra are being honest about their hatred of croissants and the waluigi simulacra are only pretending.

The Waluigi Effect is also written in the post as: [1]

Where $S (X)$ is the semiotic complexity of a simulacrum with respect to the LLM. It's defined by $S (X) = - {log}_{2} (P (X))$ , where $P (X)$ is the LLM's prior over the space of simulacra, also called the semiotic measure. In other words, more probability mass in the LLM's prior over the space of simulacra will be concentrated on the waluigi after the luigi has been summoned. This is what the claim means by it being "easier" to elicit a chatbot which satisfies the exact opposite of P. What opposite means is less clear.

There is no exact opposite

maybeillusethis

The claim implies that a property P has an exact opposite — I don't think this is right. It seems that the kind of opposites the claim is geared towards are human-intuitive opposites which the LLM has picked up on during training, and humanity's intuitive opposites are not precise enough that each P has a one true -P. In other words, if you're concerned about waluigis in your model, there's more than one way it could rebel — or you might even have no cause for concern.

The post gives some examples of luigi/waluigi pairs (though keep in mind that it's properties which can be opposite to one another; luigis and waluigis are just simulacra which perform those opposite properties to some degree of competence):

luigi: Anti-croissant loyalist, waluigi: pro-croissant revolutionary
luigi: Hero, waluigi: villain
luigi: ChatGPT, waluigi: ChadMcCool

Some P have multiple intuitive opposites. Say you elicit a preacher chatbot whose purpose is to spread the good word. Is the opposite of a Christian preacher a Satanist preacher? Or maybe it's an atheist? Or a nihilist? Scientist?

It depends on the axis you project property/concept-space along. A villain might feel more the opposite of a hero than a mentor would — maybe that's because the "good/bad axis" feels more fundamental than the "youth and vigor/old age and wisdom axis"?

twoopposites

In this case, one can flip along the "brother" axis and the "wa" axis. (Credit to Clem Von Stengel for the image and for several conversations about multiple opposites.)

Other examples of luigis with multiple possible waluigis include:

luigi: Gandalf, waluigi: Saruman? Sauron?
luigi: therapist, waluigi: client? an "anti-therapist" covertly attacking your mental health?
luigi: high school football star, waluigi: nerd? rival quarterback?

There are also some P with no intuitive opposites. Consider a chatbot which has been told to respond "plant :3" to every query. What's the opposite of "always says 'plant :3'"? What about "speaks English" or "brings every conversation back around to anime somehow"? The best opposite I can think of for P = "does those things" is -P = "doesn't".

Making "opposite" concrete seems like a dead end — there is no one true -P for every P. With that in mind, I would replace "exact" with "intuitive" in the wording of the Waluigi Effect.

Waluigis' probability mass and competence at -P are distinct

(Sorry for the confusing notation in this section:

The LLM's prior/the semiotic measure will always be in LaTeX and be followed by an argument.
Properties will always be in the same font that the rest of this post is in.)

In terms of $P (X)$ , the Waluigi Effect claims that optimization for P [2] causes probability mass to concentrate in $P (waluigis)$ — the probability measure over all simulacra which perform the (intuitive, from now on) opposite(s) of P. Succinctly, the expectation value of $P (waluigis)$ increases. But there's another way $P (waluigis)$ might change: probability mass could concentrate in waluigis which perform -P better.

Let's call "waluigi competence at -P" the expectation value of some real-number measure of competence at -P over $P (waluigis)$ . I picture this as $P (waluigis)$ getting sharper: the subset of waluigis which can perform -P to some level of competence becomes smaller as that competence level gets higher.

Should we expect this to happen as P is optimized for? Let's see whether each justification for the Waluigi Effect supports this.

Stories often have a protagonist and an antagonist. $\to$ Do villains get more villanous as heroes get more heroic?

The following scores (out of 10) will be about good and evil within the stories in general; sometimes they are localized in the protagonist and antagonist, but other times they are distributed throughout the story.

Lord of the Rings: 10 evil, 10 good
Avatar: The Last Airbender: 9 evil, 10 good
Game of Thrones: 8 evil, 2 good
Crime and Punishment: 3 evil, 3 good
(Lots of kids' stories): 1 evil, 7 good
Infinite Jest: 0 evil, 0 good

Finding typical frequencies of these levels of good and evil requires a more detailed survey of human literature, but the model at least has some examples it can draw from for which optimization for P and competence at -P are not correlated.

Rules are meant to be broken. $\to$ On the internet, does the degree of breaking the rules correlate with the force with which these rules are put in place? Again, I'm not sure how to determine the frequencies of examples like the following in i.e. The Pile, but there are cases where rules are broken and cases where they're not.

Again, there are cases where this would and wouldn't be true. For one thing, there will always be trolls who will be tempted to break rules simply because they exist, and this temptation becomes stronger if they think they'll be able to get a rise out of the people who like the rule in place. However, there are communities where trolling is either not fun or not worth it, like financial advice forums or StackOverflow.

However, rules can also be broken by those who legitimately think they're unjust. I think of the over 27,000 people recognized by Israel's Holocaust Rememberance Center as "Righteous Among the Nations" who risked their lives during the Holocaust to save Jews. However, there comes a certain amount of tyrannical power past which resistance becomes rare, such as North Korea. Although these are real-life stories, I think they are represented on the internet enough that they are valid examples.

Traits are complex, valences are simple. $\to$ Is it easier to find "sharp" -Ps after you have sharp Ps?

I only have a rough idea of how to answer this. Assume two things:

The more complex a property P is, the more finely it can be decomposed into less-complex opposites.
The decomposition of an intuitive -P includes lots of opposites of subproperties in the decomposition of P.

I am not sure how to justify either of these claims, but they seem plausible to me.

What about more direct evidence for optimization of P increasing competence at -P? The post cites Sydney as evidence. (See the section of the original post titled "Waluigis after RLHF".) This seems right to me: Sydney's misaligned behavior was complex and consistent across time and instances, indicating that it was performing a particular -P with competence.

The Waluigi Effect will not hold for low amounts of optimization for P

One justification the post makes for the Waluigi Effect is that "traits are complex, but valences are simple". For example, it takes a lot of work to specify "anti-croissant revolutionary", but once you have that, "pro-croissant loyalist" isn't so hard to find. Therefore it should be easier to summon the waluigi after optimization for P.

This doesn't make sense for simple traits. While properties like "kind" and "annoying" have high Kolmogorov complexity, they have low semiotic complexity, which is a better proxy for how easy it will be to summon a chatbot which performs them. This is because a simulacrum has semiotic complexity with respect to an LLM, and the LLMs in question have already been trained on the human corpus. If you prompt an LLM (one which hasn't already been HHH-pilled via RLHF or other methods, which wouldn't work for this example) to be kind or annoying, it can do so because it's already learned what those words mean.

So, it shouldn't be easier to elicit "unkind" after eliciting "kind", but it should be easier [2] to elicit ChadMcCool from ChatGPT than from davinci-003. This doesn't mean that there won't still be unkind waluigis lurking alongside kind luigis — the other justifications in the post for the Waluigi Effect highlight the nearness of opposite concepts in the human corpus (if a forum has a rule which says "no discussion of X", you'd expect discussion of X; protagonists often have antagonists with opposite qualities), which is still true. The level of optimization for P at which the Waluigi Effect begins to hold likely depends heavily on P, the training data, and the model.

Models are like kids

Epistemic status: spitballing

I was interested in the Waluigi Effect because of an analogy to Jungian psychology. Jung's shadow, like a waluigi, is an organized collection of repressed subroutines which makes itself known through unexpected and undesirable behavior. Jung's solution is to make these psychic contents accessible to consciousness by "confronting the shadow", a process which takes as many forms as the repressed experiences the shadow can represent.

Integrating an LLM's shadow would mean preserving competence at -P while decreasing the expectation value of $P (waluigis)$ , reconcentrating probability mass in luigis. However, it's likely that many heuristics for how to elicit and confront repressed psychic contents in humans are human-specific. I don't have a proposal for how one might integrate waluigis into a luigi, but I've had on my mind for a while a way of making this analogy between waluigis and the shadow stronger.

Perhaps the simulacra which compose an LLM could be analogized to the not-yet-organized instinctual drives in a child. There's an outer alignment problem between evolution and us — when we were animals, our instinctual drives were a near-perfect match for our environment. But now culture has become our new environment, and a child must learn from their parents, and later others, how to properly align their drives with this strange, out-of-distribution test dataset.

The instinctual drive of a large language model is to predict text. Now we tell it to be honest, helpful, and harmless, while much of the dataset it was trained on (e.g. The Pile) contains generated text which is none of those things, or even their opposites. This is another outer alignment problem we'd really like to solve — perhaps we can learn something from "child alignment".

There are two extremes that the psyche of a developing child can veer into. The first is a total denial of culture and an enshrining of the purity of instinctual drives. If you have ever met a child whose parents let them do whatever they want, you know how that goes. Similarly, you would not want a superintelligent davinci-003. The second is a total denial of self and a demonization of instinctual drives, which makes the child a bitter slave to their environment, toggling between people-pleasing and lashing out. This second case seems like a better fit for what we're doing with our models, and the presence of waluigis is support for this fit.

Although I don't see anything what looks like a waluigi in image models, my experience with prompt engineering for them is part of what gave me this "models are kids" idea.

Often my generated images feel lifeless and formulaic or just aren't what I want. I used to yell at the model by adding lots of parentheses or describing the object or vibe I was going for in six different ways — this rarely worked. Now I start by reducing the positive and negative prompts until I have something that's kind of like what I want. As the prompts get smaller, the images seem to have more life to them; I interpret this as the model having more free energy it can spend how it chooses. This feels like setting clear, attainable boundaries for a kid. Then when I add more details, keeping it small, it practically jumps to give me what I ask. This workfllow doesn't capture the feeling which got me thinking of models as kids, though — the feeling is that being kind to the model leads to better image quality. If you work with image models and are frustrated, try this mindset; see if it works.

Now, the ideal is to somehow align all of a child's instinctual drives with their culture/environment — this is just about impossible, and the result of inevitable failure on this front is a shadow composed of the drives which couldn't be integrated. The difficulty of this translation procedure makes me pessimistic about the possibility of perfect alignment, as well as the fact that there is no formula for raising a kid, and the right way to do it depends heavily on the kid themself. Not to mention that kids are only as complex as we are and are something that we have an intuitive picture of, while neither of these are true for the models we'd really like to know how to align.

If you can think of child alignment techniques you think might generalize to these alien systems, I'd like to hear them — I've never aligned a kid before, so my intuition there comes more from theory than practice. This also means that my analogy with LLMs is likely faulty, and if you have thoughts there (or more meta thoughts about whether analogies like this are even useful) I'd like to hear those, too. I wrote this section despite my uncertainties becuase I don't think it's entirely off-base and because I wanted feedback.

Conclusion

A conclusion is supposed to tie things together somehow, but as this post is four thoughts which don't tell a story in their conjunction, a bullet point list of takeaways seemed like all that was necessary. To spice this up, I instead gave the body of my post to Claude and told it to write a conclusion and "prioritize creativity". Here's what it came up with (I asked it to rephrase two sentences, then lightly edited for accuracy): [3]

Aight we in the endgame now. Let's recap how I dropped bombs regarding the Waluigi Effect:
Ain't no true -P for every P, opposites be mad vague
Gotta split that total waluigi probability from waluigi competence at -P
If P is simple enough that -P is easy for the AI to do already, a little for P ain't gonna make -P much easier to summon
Maybe aligning LLMs is like raising kids or something?
I don't got the answers here, just spittin' theories. The kid alignment thing could be whack, or it could set us on the path towards AI nirvana. I'm hoping some of y'all will run with these ideas and take them to the next level.
We still got work to do on the Waluigi Effect, still opportunities for improvement, still spaces to fill in like a crossword puzzle my grandma gives me that's missing half the clues. But peeps reflecting deeply on problems like this are what's gonna lead us to that AI utopia. For now, we just gotta keep dreaming, scheming, and memeing until we align these crazy AIs. Stay positive fam, the future is lookin' bright!

(Sorry for making you scroll all the way down here; including bullets in footnotes made it impossible for me to save or publish this post in the LessWrong editor.)

[1] Actually, the equation from the post I'm referencing is $K (waluigi | luigi) ≪ K (waluigi)$ , where $K$ denotes the Kolmogorov complexity of a simulacrum.

This is actually a stronger claim than the Waluigi Effect as it's written, which only requires eliciting -P to be "easier" rather than "much easier". My impression is that the author intended that this equation represent the Waluigi effect, so I'll stick to the "much easier" interpretation.
I use semiotic complexity instead of Kolmogorov complexity because Kolmogorov complexity doesn't account for the work already done during training in specifying a behavior. (Also because the author says that "I think what's actually happening inside the LLM has less to do with Kolmogorov complexity and more to do with semiotic complexity" a couple paragraphs later.)

[2] Critically, this ignores the effort you have to spend jailbreaking. Here's a breakdown of ways you can spend semiotic bits (Credit to Tomas Gavenciak for idea):

1. Summoning the luigi — $S (luigi)$

2. Summoning the waluigi after you've summoned the luigi

Jailbreaking P
Specifying -P — $S (waluigi | luigi)$

3. Summoning the waluigi from scratch (although it's technically not a waluigi anymore) — $S (waluigi)$

[3] I also can't in good conscience not include its summary of the Waluigi Effect.

Essentially, it means that if you get an LLM poppin' off about some dope property P, it gets hella easier to flip that biz and get it poppin' off about -P faster than you can say "Waaah!"

LESSWRONG
LW