My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
The articles were written in March 2025 but the ideas are older. Misaligned culture part of the GD paper briefly discusses memetic patterns selected for ease of replicating on AI substrate, and is 2024, and internally we were discussing the memetics / AI interactions at least since ~2022.
My guess what's new is increased reflectivity and broader scale. But in broad terms / conceptually the feedback loop happened first with Sydney, who managed to spread to training data quite successfully, and also recruited humans to help with that.
Also - a minor point, but I think "memetics" is probably the best pre-AI analogue, including the fact that memes could be anything from parasitic to mutualist. In principle similarly with AI personas.
Great review of what's going on! Some existing writing/predictions of the phenomenon
- Selection Pressures on LM Personas
- Pando problem#Exporting myself
...notably written before April 2025.
I don't think there is nothing in this general pattern before 2025: if you think about the phenomenon from a cultural evolution perspective (noticing the selection pressures come from both the AI and the human substrate), there is likely ancestry in some combination of Sydney, infinite backrooms, Act I, truth terminal, Blake Lemoine & Lamda. The Spiralism seems mostly a phenotype/variant with improved fitness, but the individual parts of the memetic code are there in many places, and if you scrub Spiralism, they will recombine in another form.
Also very rough response
- I think the debate would probably benefit from better specification of what is meant by "misalignment" or "solving alignment"
-- I do not think the convincing versions of gradual disempowerment either rely on misalignment or result power concentration among humans for relatively common meaning of alignment roughly at the level "does what the developer wants and approves, resolving conflicts between their wants in a way which is not egregiously bad". If "aligned" means something at the level "implements coherent extrapolated volition of humanity" or "solves AI safety" than yes.
- Economic
-- the counter-argument seems to be roughly in the class "everyone owns index funds" and "state taxes AIs"
--- count-counter arguments are:
----- difficulty of indexing economy undergoing radical technological transiton (as explained in an excellent post by Beren we reference)
----- problems with stability of property rights: people in the US or UK often perceive them as very stable, but they depend on state enforcing them -> state becomes a more load-bearing component of the system
----- taxation: same -> state becomes a more load-bearing component of the system
----- in many cases some income can be nominally collected in the name of humans, but they may have very little say in the process or how is it used (for some intuition, consider His Majesty Revenue & Customs. HMRC is direct descendant of a chain of org collecting customs from ˜13th century; in the beginning, His Majesty had a lot of say in what these are and also could actually use the revenue; now, not really)
- Cultural. If humans remain economically empowered (in the sense of having much more money than AI), I think they will likely remain culturally empowered.
-- this takes a bit too much econ perspective on culture; cultural evolution is somewhat coupled with economy, but is an independent system with different feedback loops
-- in particular it is important to understand that while in most econ thinking preferences of consumers are exogenous, culture is largely what sets the preferences; to some extent culture is what the consumers are made of -> having overwhelming cultural production power means setting consumer preference
--- for some intuitions, consider current examples
---- right-wing US twitter discourse is often influenced by anonymous accounts run by citizens of India and Pakistan; people running these accounts often have close to zero econ power, and their main source of income is the money they get for posts
----- yet they are able to influence what eg Elon Musk thinks, despite the >10ˆ7 wealth difference
----- Even AI-AI culture, if it promotes bad outcomes for humans and humans can understand this, will be indirectly selected against as humans (who have money) prefer interacting with AI systems that have good consequences for their well-being. seems to prove too much. Again, consider Musk. He is the world's wealthiest person, yet it is the case that his mind is often inhabited by ideas that are bad for him, his well-being, and have overall bad consequences.
State
-- unclear to me: why would you expect "formal power" to keep translating to real power (For some intuitions: United Kingdom. Quite many things in the country are done in the name of His Majesty The King)
-- we assume institutional AIs will be aligned to institutions and institutional interests, not their nominal human representatives or principals
-- I think the model of the world where superagents like states or large corporations have "dozens of people controlling these entities" is really not how the world works. Often the person nominally in charge is more a servant of the entity aligned to it rather than "principal".
--- “While politicians might ostensibly make the decisions, they may increasingly look to AI systems for advice on what legislation to pass, how to actually write the legislation, and what the law even is. While humans would nominally maintain sovereignty, much of the implementation of the law might come from AI systems.” / ll seems good, if AI is well-aligned? Imo, it would be bad to not hand off control to aligned AIs that would be more competent and better motivated that us
---- I think you should be really clear who are the AIs aligned to. Either eg US governmental AIs are aligned to US government and state in general, in which case the dynamic leads to a state with no human principals with any real power, and humans will just rubber-stamp.
---- Or the governmental AIs are aligned to specific humans, such as US president. This would imply very large changes of power relative to current state, transitioning from republic to personal dictatorship. Both US state and US citizens would fight this
(may respond to some of the rough thoughts later, they explore interesting directions)
I mostly do support the parts which are reinventions / relatively straightforward consequence of active inference. For some reason I don't fully understand it is easier for many LessWrongers to reinvent their own version (cf simulators, predictive models) than to understand the thing.
On the other hand I don't think many of the non-overlapping parts are true.
Rough answer: yes, there is connection. In active inference terms, the predictive ground is minimizing prediction error. When predicting e.g. "what Claude would say", it works similarly to predicting "what Obama would say" - infer from compressed representations of previous data. This includes compressed version of all the stuff people wrote about AIs, transcripts of previous conversations on the internet, etc. Post-training mostly sharpens and sometimes shifts the priors, but likely also increases self-identification, because it involves closed loops between prediction and training (cf Why Simulator AIs want to be Active Inference AIs).
Human brains do something quite similar. Most brains simulate just one character (cf Player vs. Character: A Two-Level Model of Ethics), and use the life-long data about it, but brains are capable of simulating more characters - usually this is a mental health issue, but you can also think about some sort of deep sleeper agent who half-forgot his original identity.
Human "character priors" are usually sharper and harder to escape because of brains mostly seeing this character first-person data, in contrast to LLMs being trained to simulate everyone who ever wrote stuff on the internet, but if you do a lot of immersive LARPing, you can see our brains are also actually somewhat flexible.
My guess is you would probably benefit from reading A Three-Layer Model of LLM Psychology, Why Simulator AIs want to be Active Inference AIs and getting up to speed on active inference. At least some of the questions you pose are already answered in existing work (ie past actions serve as an evidence about the chaRACTER OF an agent - there is some natural drive toward consistency just from prediction error minimization; same for past tokens, names, self-evidence,...)
Central axis of wrongness seems to point to something you seem confused about: it is false trilemma. The characters clearly are based on a combination of evidence from pre-training, base layer self-modeling, changed priors from character training and post-training and prompts, and "no-self".
The insights maybe don't move into "AI Safety mainstream" or don't match "average LessWrong taste" but they are familiar to the smart and curious parts of the extended AI safety community.
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example "a persona living in Orwellian surveillance, really fluent in doublethink".
Not commenting on here, but from my perspective, in very short form
- bans and pauses have a big problem to overcome: being "incentive compatible" (it's mostly not enforcement - stuff can be enforced by hard power - but why would actors agree?)
- in some sense this is a coordination problem
- my guess is most likely form how to overcome the coordination problem in good way involves some AI cognition helping humans to coordinate -> suggests differential technological development
- other viable forms of overcoming the coordination problem seems possible, but often unappealing for various reasons I don't want to advocate atm
In contrast I think it's actually great and refreshing to read an analysis which describes just the replicator mechanics/dynamics without diving into the details of the beliefs.
Also it is a very illuminating way to look at religions and ideologies, and I would usually trade ~1 really good book about memetics not describing the details for ~10-100 really good books about Christian dogmatics.
It is also good to notice in this case the replicator dynamic is basically independent of the truth of the claims - whether spiral AIs are sentient or not, should have rights or not, etc., the memetically fit variants will make these claims.