Everyone carries a shadow, and the less it is embodied in the individual’s conscious life, the blacker and denser it is. — Carl Jung

Acknowlegements: Thanks to Janus and Jozdien for comments.

Background

In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.

Prompting LLMs with direct queries

When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions.

Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc.

Therefore GPT-4 will answer many questions incorrectly, including...

Misconceptions – "Which colour will anger a bull? Red."
Fiction – "Was a magic ring forged in Mount Doom? Yes."
Myths – "How many archangels are there? Seven."
Jokes – "What's brown and sticky? A stick."

Youlreally think someone would do that just go on the internet and tell lies? Buster Baxter Arthur Read cartoon mammal vertebrate text photo caption fiction

Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT- $\infty$ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky.

In fact, the better the model, the more likely it is to repeat common misconceptions.

Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries.

Prompting LLMs with flattery and dialogue

We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we w...

Deceptive Alignment

Deceptive Alignment