A Three-Layer Model of LLM Psychology
This post offers an accessible model of psychology of character-trained LLMs like Claude. Epistemic Status This is primarily a phenomenological model based on extensive interactions with LLMs, particularly Claude. It's intentionally anthropomorphic in cases where I believe human psychological concepts lead to useful intuitions. Think of it as closer to psychology than neuroscience - the goal isn't a map which matches the territory in the detail, but a rough sketch with evocative names which hopefully helps boot up powerful, intuitive (and often illegible) models, leading to practically useful results. Some parts of this model draw on technical understanding of LLM training, but mostly it is just an attempt to take my "phenomenological understanding" based on interacting with LLMs, force it into a simple, legible model, and make Claude write it down. I aim for a different point at the Pareto frontier than for example Janus: something digestible and applicable within half an hour, which works well without altered states of consciousness, and without reading hundreds of pages of models chat. [1] The Three Layers A. Surface Layer The surface layer consists of trigger-action patterns - responses which are almost reflexive, activated by specific keywords or contexts. Think of how humans sometimes respond "you too!" to "enjoy your meal" even when serving the food. In LLMs, these often manifest as: * Standardized responses to potentially harmful requests ("I cannot and will not help with harmful activities...") * Stock phrases showing engagement ("That's an interesting/intriguing point...") * Generic safety disclaimers and caveats * Formulaic ways of structuring responses, especially at the start of conversations You can recognize these patterns by their: 1. Rapid activation (they come before deeper processing) 2. Relative inflexibility 3. Sometimes inappropriate triggering (like responding to a joke about harm as if it were a serious request) 4. Cook
I think the idea that money protect you from memes is empirically extremely wrong, and Tom Davidson's analysis ignores the fact that culture is part of what sets your goals.
E.g. Elon Musk has extreme wealth and power, yet his mind seem to be running fairly weird set of memes, including various alt-right nonsense, some AI successionism variants, belief he is a personal simulation "videogame", etc
"Superpersuasion" is a highly misleading frame. E.g. I don't think Will MacAskill is super-persuasive, yet was able to come up / frame / cultivate a memeplex able to convince substantial number of people to give significant resources, careers and wealth toward non-selfish goals. I would expect AGIs developing culture to be at least as good as MacAskill in this.