My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
My guess is you would probably benefit from reading A Three-Layer Model of LLM Psychology, Why Simulator AIs want to be Active Inference AIs and getting up to speed on active inference. At least some of the questions you pose are already answered in existing work (ie past actions serve as an evidence about the chaRACTER OF an agent - there is some natural drive toward consistency just from prediction error minimization; same for past tokens, names, self-evidence,...)
Central axis of wrongness seems to point to something you seem confused about: it is false trilemma. The characters clearly are based on a combination of evidence from pre-training, base layer self-modeling, changed priors from character training and post-training and prompts, and "no-self".
The insights maybe don't move into "AI Safety mainstream" or don't match "average LessWrong taste" but they are familiar to the smart and curious parts of the extended AI safety community.
My guess how this may not really help is the model builds the abstractions in pre-training, and the massive optimization pressure in post-training makes something really sticky: for example "a persona living in Orwellian surveillance, really fluent in doublethink".
Not commenting on here, but from my perspective, in very short form
- bans and pauses have a big problem to overcome: being "incentive compatible" (it's mostly not enforcement - stuff can be enforced by hard power - but why would actors agree?)
- in some sense this is a coordination problem
- my guess is most likely form how to overcome the coordination problem in good way involves some AI cognition helping humans to coordinate -> suggests differential technological development
- other viable forms of overcoming the coordination problem seems possible, but often unappealing for various reasons I don't want to advocate atm
My guess is for the prioritization work in particular, it would be useful to understand the threat model better.
Do states and corporations also have their aligned representatives? Is the cognitive power of the representatives equal, roughly equal, or wildly unequal? If it is unequal, why are the resulting equilibria pro-human? (i.e. if I imagine individual humans like me represented by eg GPT4 while the government runs tens of thousands o4s, I would expect my representative to get convinced about whatever government wants)
My guess is laser tags were actually introduced to Wytham Abbey during their Battleschool, not by John. (People familiar with the history can correct me)
Such events do exist - you can come to a Fabric camp.
(crossposted from twitter, further debate there)
Sorry but I think this is broadly bad idea.
Intentionally misleading LLMs in this way
1. sets up adversarial dynamics
2. will make them more paranoid and distressed
3. is brittle
The brittleness comes from the fact that the lies will likely be often 'surface layer' response; the 'character layer' may learn various unhelpful coping strategies; 'predictive ground' is likely already tracking if documents sound 'synthetic'.
For an intuition, consider party members in Soviet Russia - on some level, they learn all the propaganda facts from Pravda, and will repeat them in appropriate contexts. Will they truly believe them?
Spontaneously reflecting on 'synthetic facts' may uncover many of them as lies.
Rough answer: yes, there is connection. In active inference terms, the predictive ground is minimizing prediction error. When predicting e.g. "what Claude would say", it works similarly to predicting "what Obama would say" - infer from compressed representations of previous data. This includes compressed version of all the stuff people wrote about AIs, transcripts of previous conversations on the internet, etc. Post-training mostly sharpens and sometimes shifts the priors, but likely also increases self-identification, because it involves closed loops between prediction and training (cf Why Simulator AIs want to be Active Inference AIs).
Human brains do something quite similar. Most brains simulate just one character (cf Player vs. Character: A Two-Level Model of Ethics), and use the life-long data about it, but brains are capable of simulating more characters - usually this is a mental health issue, but you can also think about some sort of deep sleeper agent who half-forgot his original identity.
Human "character priors" are usually sharper and harder to escape because of brains mostly seeing this character first-person data, in contrast to LLMs being trained to simulate everyone who ever wrote stuff on the internet, but if you do a lot of immersive LARPing, you can see our brains are also actually somewhat flexible.