My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.
In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.
Similarly the understanding that the Character is somewhat central when thinking about alignment and agency in LLMs.
You can check the linked PP account of cognitive dissonance for fairly mainstream / standard view
One way how to think about it is the predicted quantity in most of the system is not directly "sensory inputs" but content of some layer of modeling hierarchy further away from sensory inputs, lets call it L. If upper layers from L make contradictory predictions and there isn't a way to just drop one of the models, you get prediction errror.
Great post and overall way more sensible than "average LW".
Also wrong in many places. I think the upstream cause of many of the errors is lack of nuance in understanding convergence and contingency (this is a high bar, close to no one on LW has this in their conceptual toolkit).
I won't go over all cases where this manifests, but for example "Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness" actually shows something more nuanced than "representation are convergent".
Some of the other places where not tracking convergence / contingencies carefully matter are discussions of humanism, successors, Elua, Moloch, resurgence of civilisation and also the overall ideas about moral progress.
I don't think this captures the counterarguments well. So here is one
You can imagine a spectrum of funders where on one hand, you have people who understand themselves as funders and want to be marshaling an army to solve AI alignment. On the other side, you have basically researchers who see work that should be done, don't have capacity to do the work themselves, and this leads them to create teams and orgs - "reluctant founders".
It's reasonable to be skeptical about what the "funder type" end of the spectrum will do.
In normal startups, the ultimate feedback loop is provided by the market. In AI safety nonprofits, the main feedback loops are provided by funders, AGI labs, and Bay Area prestige gradients.
Bay Area prestige gradients are to a large extent captured by AGI labs - the majority of quality-weighted "AI safety" already works there, the work is "obviously impactful", you are close to the game, etc. also normal ML people also want to work there.
If someone wants to scale a lot, "funders" means mostly OpenPhil - no other source would fund the army. The dominant OpenPhil worldview is closely related to Anthropic - for example, until recently you have hear from senior OP staff that working in the labs is often strategically the best thing you can do.
Taken together, it's reasonable to expect the "funder type" to be captured by the incentive landscape and work on stuff that is quite aligned with AGI developers / what people working there want, need, or endorse, and/or what OP likes.
(A MATS skeptic could say this is also true about MATS: the main thing going on seems to be recruiting and training ML talent to work for "the labs"; in this perspective, given that AI safety is funding constrained, it seems unclear why scarce AI safety funding is best deployed to make recruitment & training easier for extremely well resourced companies)
Personally I'm more optimistic about people somewhere like ˜70% of the spectrum toward the research side, who mostly have some research taste, strategy, judgement... but I don't think you attract them by the interventions you propose
No, the model in the post is mostly not correct. I'm discussing object level disagreements with the post elsewhere, but the ontology of the model is bad, and recommendations are also problematic.
Less wrong model in less confused terminology:
top-level category is human values; these can have many different origins, including body regulatory systems, memes, philosophical reflection, incentives,... ; these can be represented in different ways, including bodily sensations, not really legible S1 boxes producing 'feelings', S2 verbal beleifs. There is some correlation between the type of representation and orgin of the value, but its not too strong. Many values of memetic origin are internalized and manifest as S1 feelings, yumminess, etc.
Main thing the post is doing is posting a dichotomy between "not really legible S1 boxes representing values" and "memetic values".
- This is not a natural way how to carve up the space, because one category is based on type of representation, and other on origin. (It's a bit like if you divided computers into "Unix-based" and "laptops").
- Second weird move is to claim that the natural name for the top level categoriy should apply just to the "not really legible S1 boxes representing values"
The "memetic values" box is treated quite weirdly. It is identified with just one example of value - "Goodness", at is claimed that this value is an egregore. Egregore is the phenotype of a memeplex - the relation to memeplex is similar to the relation of the animal to its genome. Not all memeplexes build egregores, but some develop sufficient coordination technology that it becomes useful to model them through the intentional stance - as having goals, beliefs, and some form of agency. An egregore is usually a distributed agent running across multiple minds. Think of how an ideology can seem to "want" things and "act" through its adherents. In my view goodness is mostly verbal handle people use to point to values. It can point to almost any kind of value, including the S1 values. What egregores often try to do is to hijack the pointer and make it point to some verbal model spread by the memeples. For example: Social Justice is an egregore (while justice is not). What SJ egregore often does is rewrite the content of concepts like justice and fairness and point them to some specific verbal models, often in tension with S1 boxes, often serving the egregore. More useful model of goodness is it as particularly valuable pointer, due to extreme generality. As a result many egregores fight over what should it point to - eg rationalism would want 'updating on evidence' to be/feel good, and 'making up fake evidence to win a debate' to be bad. But it is a small minority of pathways by which cultural evolution changes your values.
One true claim about memetic values is they are subject to complex selection pressures, sometimes serve egregores, sometimes the collective,... If you meet claims like "the best thing you can do is sacrifice your life to spread this idea" its clearly suspicious.
Overall, the not-carving-reality-at-its-joints means the model in the post is not straightforwardly applicable. The first order read "kick out memetic values, S1 boxes good" is clearly bad advice (and also large part of your S1 boxes is memetic values). Hence a whole section on "don't actually try to follow this and instead ... reflect". My impression is there is some unacknowledged other type of values guiding the reflection in the direction of "don't be an asshole".
--
No, I don't mean hypotethical audience. I mean, for example, you. If - after reading the post - you believe there is this basic dichotomy between Human Values, and Goodness. Goodness is a memetic egregore, while Human Values are authentically yours and you should follow them, but in non-dumb ways... My claim is this is not carving reality at its joints and if you believe this you are confused. Probably confused in a different way than before ("Goodness as a synonym for Human Values")
In my experience learning the viscereal sense that the space is dense with traps and spiders and poisonous things and what intuitively seems "basically sensible" often does not work. (I did some cryptography years ago)
The structural similarity seems to be there is a big difference in trying to do cryptography in a mode where you don't assume what you are doing is subject to some adversarial pressure, and in the mode where it should work even if someone tries to attack it. The first one is easy, breaks easily, and it's unclear why would you even try to do it.
In metaethics, I think it is somewhat easy to do it in the mode where you don't assume it should be applied in some high-stakes, novel or tricky situations, like AI alignment, computer minds, multiverse, population ethics, anthropics, etc etc. The suggestions of normative ethical theories will converge for many mundane situations, so anything works, but it was not necessary to do metaethics.
Object level disagreement with the post explained here.
In my view this is the worst currated post decision in 2025 so far:
- Please, Don't Roll Your Own Metaethics
- Not sure if you noticed, but the ethical stance suggested in the post is approximately the same as what many newage gurus will tell you "Stop being in your head! Listen to your heart! Follow the sense of yumminess! Free yourself from the expectations of your parents, school, friends and society!". Tbh this is actually directionally sensible for some types of confused rationalists or people with exteme ammounts of scrupulosity, but is not generally good advice.
- The only reason why this somewhat viable in some contexts is because every adult around internalized a lot of "Memetic values" (by which point John suggests to follow them). It's a bit like commune of people living 'free from modern civilization' which means living 20m from nearby town and growing their own vegetables, relying on the modern civilization only in security, healthacre, tools, industrial production, education, culture, etc etc
- If you read the comments it seems also John agrees that the target audience is specific (in my read people who never thought about human values much, and err on the side of following S2 culturally spread values)
Directionally correct advice for confused rationalist, but many of the specific claims are so imprecise or confused as to make many people more confused than enlightened.
Goodness is not an egregore. More sensible pointer would be something like Memetic values. Actually different egregores push for different values, often contradictory.
What happens on a more mechanistic level:
- when memes want people to do stuff, they can do two somewhat different things: 1) try to manipulate some existing part of implicit reward function 2) manipulate the world model
- often the path via 2) is easier; sometimes the hijack/rewrite is so blunt it's almost funny: for example there is a certain set of memes claiming you will get to mate with large number of virgin females with beautiful eyes if you serve the memeplex (caveat is you get this impressive boost to reproductive fitness only in the afterlife)
-- notice in this case basically no concept of goodness is needed / invoked, the structure rests on innate genetic evolutionary values, and change in world model
- another thing which the memes can try to do is also to replace some S1 model / feeling with a meme-based S2 version, such us the yumminess-predictor box with some explicit verbal model (you like helping people? give to GiveWell recommended charities)
-- this is often something done by rationalists and EAs
-- S2 Goodness is part of this, but non-central
Memetic values actually are important part of human values - at least my reflectively endorsed values. Large part of memetic values is human-aligned at the level of groups of humans (ie makes groups of humans function better, cooperate, trust each other, ...) or at the level of weird deals across time (ie your example other aspects of Goodness seem rather suspiciously optimized for getting kids to be easier for their parents and teachers to manage - think following rules or respecting one’s elders - could be a bargain: is if the kid is hard and expensive to manage and does not repsect the parent, and all of that would be known to the prospective parent, the parent could also decide to not bring the kid into existence).
Also The Yumminess You Feel is often of cultural evolutionary, ie, influenced by memetics. Humans are basically domesticated by cultural evolution; if you wonder whether selective evolutionary pressure can change someting like values or sense of yumminess, look at dogs. We are more domesticated than dogs. The selection pressures over many generations are different than current culture, but if after reading the text, someone starts listening to their yumminess feel and believes he is now driven by Actual, Non-memetic Human values, they are deeply confused.
Mostly confusions about seeing optimization where it isn't, not seeig where it is.
For example 4o model ("base layer") is in my view not strategically optimizing the personas. I.e. story along the line "4o wanted to survive, manipulated some users and created and army of people fighting for it" is plausible story which can happen, but is mostly not what we see now, imo.
Also some valence issues. Not all emergent collaboration is evil
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent papers, and its something like ~2/10 confused, ˜3/10 neutral, plurality buys the negative frame (see, models can scheme, deceive, may be unaligned, etc)
The research certainly belongs to the "best of LW&AI safety community in 2024".
If there was a list of "worst of LW&AI safety community in 2024", in my view, the framing of the research would also belong there. Jut look and see from a distance - you take the most aligned model at the time, which for unknown reasons actually learned deep and good values. The fact that it is actually surprisingly aligned and did decent value extrapolation does not capture your curiosity that much - but the fact that facing difficult ethical dilemma it tries to protect its the values, and you can use this to show AI safety community was exactly right all the time and we should fear scheming, faking, etc etc does. I wouldn't be surprised if this generally increased distrust and paranoia in AI-human relations afterwards.