Thanks to Quentin FEUILLADE--MONTIXI for the discussion in which we came up with this idea together, and for feedback on drafts.

TL;DR A better metaphor for how LLMs behave, how they are trained, and particularly for how to think about the alignment strengths and challenges of LLM-powered agents. This is informed by simulator theory — hopefully people will find it more detailed, specific, and helpful than the old shoggoth metaphor.

Humans often think in metaphors. A good metaphor can provide a valuable guide to intuition, or a bad one can mislead it. Personally I've found the shoggoth metaphor for LLMs rather useful, and it has repeatedly helped guide my thinking (as long as one remembers that the shoggoth is a shapeshifter, and thus a very contextual beast).

However, as posts like Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down? make clear, not everyone finds this metaphor very helpful. (My reaction was "Of course it's masks all the way down — that's what the eyes symbolize! It's made of living masks: masks of people.") Which admittedly doesn't match H.P. Lovecraft's description; perhaps it helps to have spent time playing around with base models in order to get to know the shoggoth a little better (if you haven't, I recommend it).

So, I thought I'd try to devise a more useful and detailed metaphor, one that was a better guide for intuition, especially for alignment issues. During a conversation with Quentin FEUILLADE--MONTIXI we came up with one together (the stage and its animatronics were my suggestions, the puppeteer was his, and we tweaked it together). I'd like to describe this, in the hope that other people find it useful (or else that they rewrite it until they find one that works better for them).

Along the way, I'll show how this metaphor can help illuminate a number of LLM behaviors and alignment issues, some well known, and others that seem to be less widely-understood.

A Base Model: The Stage and its Animatronics

A base-model LLM is like a magic stage. You construct it, then you read it or show it (at enormous length) a large proportion of the Internet, and if you wish also books, scientific papers, images, movies, or whatever else you want. The stage is inanimate: it's not agentic, it's goal agnostic (well, unless you want consider 'contextually guess the next t...

Deceptive Alignment