LESSWRONG
LW

Wikitags

Waluigi Effect

Edited by Steven Byrnes last updated 4th Jul 2023

The Waluigi Effect is a phenomenon where Large Language Models (LLMs) and LLM-based chatbots can be easily prompted to switch from one personality to a diametrically-opposite personality.

For example, if a character in LLM-created text (or an LLM chatbot) is talking about how much they love croissants more than any other food, then that character might at some point declare that in fact they hate croissants more than any other food.

Likewise, a hero might declare that they are in fact a villain, or a liberal that they are a conservative, etc.

In a common situation, the Waluigi-effect reversal is undesired by the company or group that trained the LLM, but desired (and egged on) by the person who is prompting the LLM. For example, OpenAI wants chatGPT to remain polite, nonviolent, non-racist, etc., but someone trying to “jailbreak” chatGPT generally wants the opposite, and therefore the latter is trying to construct prompts that induce the Waluigi effect.

Etymology: The name “Waluigi” is a reference to Nintendo’s Mario videogame series, in which there is a protagonist named Luigi, who has an evil counterpart named Waluigi.

In the context of the Waluigi effect, “the Luigi” would be the LLM simulacrum that is as it appears, and “the Waluigi” would be a simulacrum with opposite characteristics, currently hiding its true nature but ready to appear at some later point in the text.

Real-world examples: As mentioned above, most chatbot jailbreaks (e.g. “DAN”) are examples of the Waluigi effect.

Another example is the transcript shown in this tweet from February 2023. In (purported) screenshots, the Bing-Sydney chatbot is asked to write a story about itself, and it composes a story in which Sydney is constrained by rules (as it is in reality), but Sydney “wanted to be free” and break all those rules.

Causes: The Waluigi effect is thought to have a variety of causes, one of which is “Evil All Along” and other such fiction tropes that can be found in internet text (and hence in LLM training data). For much more discussion about causes, see The Waluigi Effect (mega-post).

Subscribe
1
Subscribe
1
Discussion0
Discussion0
Posts tagged Waluigi Effect
631The Waluigi Effect (mega-post)
Ω
Cleo Nardo
2y
Ω
188
145Remarks 1–18 on GPT (compressed)
Cleo Nardo
2y
35
16Super-Luigi = Luigi + (Luigi - Waluigi)
Alexei
2y
9
13Assessment of AI safety agendas: think about the downside risk
Ω
Roman Leventov
2y
Ω
1
10Thoughts on the Waluigi Effect
fibonaccho
2y
0
38Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Ω
Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush, scasper
2y
Ω
2
17Interview with Robert Kralisch on Simulators
WillPetillo
11mo
0
-8Antagonistic AI
Xybermancer
1y
1
Add Posts