I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during SERI MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
Agreed, on all counts
Great post!
Anyone still confused after this, should go read that post.
If we had some serious evidence that souls not only exist, but also precede the existence of a person, that a soul is somehow chosen to be instantiated in newborn, then it would be understandable why we could assume that being born is a random sample
Only if there are already quadrillions of souls. Which would be necessary later if and when we go to the stars, but seem like more angels than are required to dance on the head of this particular pin right now. Cultures that believe in reincarnation tend to believe rebirths occur fairly frequently, with a shortish delay after death. If this is correct then it is relatively easy to deduce that the number of souls being incarnated as humans must have been rising in proportion with the hockey-stick curve of human population over the history of our species. So either souls reproduce or are produced somehow (such as by duplication or binary fission or intelligent creation), or else some of a very large number of souls that use to incarnate as, say, ants or paramecia are starting to incarnate as humans. If this is in fact the Kali Yuga, then that last hypothesis sounds rather plausible.
Yes, performing a predicted random sample over predicted future humans according to some model, or Bayesian distribution of models is fine — but in the case of the Bayesian model distribution case, if you have large uncertainty within your hypothesis distribution about how many there will be, that will dominate the results. What breaks causality is attempting to perform an actual random sample over the actual eventual number of future humans before that information is actually available, and then using frequentist typicality arguments based on that hypothetical invalid sampling process to try to smuggle information from the future into updating your hypothesis distribution.
What is a good mathematical framing for ‘personas’?
The distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training).
Note that "wikipedia article" is a valid persona. So is "academic paper". Not all of them represent single humans. Some even represent automated processes.
What are our existing techniques for discovering persona archetypes
Read Jung? (I'm not being flippant, this is a serious suggestion: personas are a world model of human behavior, and this is not a new subject.)
This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2).
I think modelling a great many different personas and keeping them all straight is a core ability / capability spike of an LLM. Base models (the model itself, not the personas it simulates) are far, far better at it than any human actor. So I would expect it to model dataset 1 and dataset 2 as two different personas, and be able to switch between them easily. Which is probably not the behavior the people applying the training to it were intending.
and more facts stored about the <|assistant|> character
See the entire alignment pretraining research agenda for more on this.
In this case, all those customers were already alive when the shop opened (I assume), so the observation does suggest that, if this process is in fact going to continue for several more hours with the store getting more and more crowded, then there might well be be some mechanism that applies to most customers that causes them to choose to arrive late, but somehow doesn't apply to you. For example, maybe they do know when the store closes, and that the store's chili gets stronger the longer it's cooked, and they all like very strong chili.
The causality here is different, because you can reasonably assume that the other customers got up in the morning, thought about "When should I go to Fred's Chili Shop?" and it seems a lot of them picked "not long before it closes". But you are implicitly assuming that you already know this process is in fact going to continue. So it's rather as if you asked Fred, and he told you yeah, there's always a big rush at the end of the day, few people get here as early as you. At that point the causal paradox has just gone away: you actually do have solid grounds for making a prediction about what's going to happen later in the day — Fred told you, and he should know.
But if you know for a fact that all the customers are only 10 minutes old (including you) so decided to come here less than 10 minutes ago, then the only reasonable assumption is that there's a very fast population explosion going on, and you have absolutely no idea how much longer this is going to last, or how soon Fred will run out of chili and close the shop. In that situation, your predictability into the future is just short, and you just don't know what's going to happen after that — and clearly neither does Fred, so you can't just ask him.
Hmmm... Good question. Let's do the Bayesian thing.
I think it's because of our priors. In the normal city case, we already know a lot about human behavior, we have built up very strong priors that constrain the hypothesis space pretty hard. The hotter-chili hypothesis I came up with seems plausible, there are others, but the space of them is rather tightly constrained. So we can do forward modelling fairly well. Whereas in the Doomsday Argument case, or my artificial analogy to it involving 10 minute lifespans and something very weird happening, our current sample size for "How many sapient species survive their technological adolescence?" or "What happens later in the day in cities of sapient mayflies?" is zero. In dynamical systems terms, the rest of the day is a lot more Lyapunov times away in this case. From our point of view, a technological adolescence looks like a dangerous process, but making predictions is hard, especially about the future of a very complex very non-linear system with 8.3 billion humans and an exponentially rising amount of AI in it. The computational load of doing accurate modelling is simply impractical, so our future even 5–10 years out looks like a Singularity to our current computational abilities. So the constraints on our hypothesis distribution are weak, and we end up relying mostly on our arbitrary choice of initial priors. We're still at the "I really just don't know" point in the Bayesian process on this one. That's why people's P(DOOM)s vary so much — nobody actually knows, they just have different initial default priors, basically depending on temperament. Our future is still a Rorschach inkblot. Which is not a comfortable time to be living in.