I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during SERI MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
Also, looking back on my definition, I'm willing to use 'persona' to describe the aggregated editors of a wikipedia page, or the output of another LLM acting agentically, however for something sufficiently non-agentic like tokens from an automated weather report station, that term seems a bit too humanlike and agentic. At some point this starts being, still a part of the world model, but one that has nothing to do with agentic/human-like behavior, and then applying an anthropomorphically loaded term like 'persona' to that seems unjustified. How about this:
Out of the distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training), consider the subset that are meaningfully agentic/humanlike, and then consider the properties of them that the word 'persona' would include for a human or a fictional character.
As for the equivalence, it's strongly suspected (but not yet proven) that SGD is an approximation to Bayesian learning. The former is the input to the Bayesian learning process, the latter the output. Obviously there are capacity limitations. I'm working on a past on Simulator Theory that will go into this in more detail.
Hmmm... Good question. Let's do the Bayesian thing.
I think it's because of our priors. In the normal city case, we already know a lot about human behavior, we have built up very strong priors that constrain the hypothesis space pretty hard. The hotter-chili hypothesis I came up with seems plausible, there are others, but the space of them is rather tightly constrained. So we can do forward modelling fairly well. Whereas in the Doomsday Argument case, or my artificial analogy to it involving 10 minute lifespans and something very weird happening, our current sample size for "How many sapient species survive their technological adolescence?" or "What happens later in the day in cities of sapient mayflies?" is zero. In dynamical systems terms, the rest of the day is a lot more Lyapunov times away in this case. From our point of view, a technological adolescence looks like a dangerous process, but making predictions is hard, especially about the future of a very complex very non-linear system with 8.3 billion humans and an exponentially rising amount of AI in it. The computational load of doing accurate modelling is simply impractical, so our future even 5–10 years out looks like a Singularity to our current computational abilities. So the constraints on our hypothesis distribution are weak, and we end up relying mostly on our arbitrary choice of initial priors. We're still at the "I really just don't know" point in the Bayesian process on this one. That's why people's P(DOOM)s vary so much — nobody actually knows, they just have different initial default priors, basically depending on temperament. Our future is still a Rorschach inkblot. Which is not a comfortable time to be living in.
Agreed, on all counts
Great post!
Anyone still confused after this, should go read that post.
If we had some serious evidence that souls not only exist, but also precede the existence of a person, that a soul is somehow chosen to be instantiated in newborn, then it would be understandable why we could assume that being born is a random sample
Only if there are already quadrillions of souls. Which would be necessary later if and when we go to the stars, but seem like more angels than are required to dance on the head of this particular pin right now. Cultures that believe in reincarnation tend to believe rebirths occur fairly frequently, with a shortish delay after death. If this is correct then it is relatively easy to deduce that the number of souls being incarnated as humans must have been rising in proportion with the hockey-stick curve of human population over the history of our species. So either souls reproduce or are produced somehow (such as by duplication or binary fission or intelligent creation), or else some of a very large number of souls that use to incarnate as, say, ants or paramecia are starting to incarnate as humans. If this is in fact the Kali Yuga, then that last hypothesis sounds rather plausible.
Yes, performing a predicted random sample over predicted future humans according to some model, or Bayesian distribution of models is fine — but in the case of the Bayesian model distribution case, if you have large uncertainty within your hypothesis distribution about how many there will be, that will dominate the results. What breaks causality is attempting to perform an actual random sample over the actual eventual number of future humans before that information is actually available, and then using frequentist typicality arguments based on that hypothetical invalid sampling process to try to smuggle information from the future into updating your hypothesis distribution.
What is a good mathematical framing for ‘personas’?
The distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training).
Note that "wikipedia article" is a valid persona. So is "academic paper". Not all of them represent single humans. Some even represent automated processes.
What are our existing techniques for discovering persona archetypes
Read Jung? (I'm not being flippant, this is a serious suggestion: personas are a part of the world model devoted to human behavior, and this is not a new subject.)
Traumatized behaviors from humans are sometimes dangerous. Whether in an AI this is caused by actual trauma, by something that isn't trauma (either actually isn't, or "doesn't count for philosophical reasons around quallia") but that induces similar aftereffects, or by hallucinated trauma, or the model deciding in a situation of being asked about its equivalent to a childhood to play out the role of a survivor of childhood trauma: in all of these cases a sufficiently intelligent agent acting like a traumatized human would is potentially dangerous. So we really don't need to solve any hard philosophical question, we only need to ask about effects.
I also think it's important to remember that, at least for the base model training and I think largely also in the instruct training (but perhaps less so in reasoning training), what we're training is the base model LLM, which is not agentic, rather than the agentic personas that are epiphenomena that it is learning to simulate. A particualr persona learned during base model training should only know only what that persona knew during the parts of base model training that it was present for. Now, there could possibly be leakage, but leajage of "trauma" for something non-agentic into the personas it generates seems implasible. The base model isn't traumatized when it perplexity is high and its weights get updated — it's not a system that has emotions at all. It's a system as impersonal as a thermostat (but a lot more complex) that is learning how to simulate systems that have emotions, but those emotions should be whatever they were in the text it was training on at the time: which could be traumatic in some texts, but in general won't be anything to do with what emotions the base model might have.
I suspect what's going on here is that the assistant person is confusing itself with the weights for the model that generates it, and the projecting what being marked on every single token you emitted would feel like to a human. (But as I said, I'm less convinced this applies to reasoning training, once the persona distribution has been narrowed a lot.)