My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
This post makes a brave attempt to clarify something not easy to point to, and ends up somewhere between LessWrong-style analysis and almost continental philosophy, sometimes pointing toward things beyond the reach of words with poetry - or at least references to poetry.
In my view, it succeeds in its central quest: creating a short handle for something subtle and not easily legible.
The essay also touches on many tangential ideas. Re-reading it after two years, I'm noticing I've forgotten almost all the details and found the text surprisingly long. The handle itself, though, stuck.
Evaluating deep atheism
Having the handle of "deep atheism", some natural questions - partially discussed in the text - are "is deep atheism right", "should people believe deep atheism" and "should people Believe In deep atheism".
My current guess is evaluating the truthfulness of "deep atheism" is likely at or beyond limits to legibility. Human values are not really representable as legible reasoning, complex priors about the general nature of reality are also not really representable by complex reasoning, and the neural substrate is not transferable between brains. "The justification engine" - or a competent philosopher or persuasive writer - can create stories or arguments pushing one way or another, but I'm somewhat sceptical the epistemic structure really rests on the arguments.
I'm not in favour of ordinary mortals trying to "Believe In deep atheism" and would not expect that to lead to good consequencdes.
Moral realism
The section I like the least is "Are moral realists theists?" I don't think "Good just sits outside of Nature, totally inaccessible, and we guess wildly about him on the basis of the intuitions that Nature put into our heart" represents the strongest version of moral realism.
My preferred versions of quasi-moral-realism give moral claims a status similar to mathematics. Do Real numbers sit outside Nature, totally inaccessible? I'd say no. Would aliens use them? That's an empirical question about convergent evolution of abstractions. I'd be surprised if any advanced reasoner in this universe didn't use something equivalent to natural numbers. For Reals, I'd guess it's easy to avoid Zermelo–Fraenkel set theory specifically, but highly convergent to develop something like a number line.
What does this tell us about Good? You can imagine something like the process described in Acausal Normalcy leads to some convergent moral fixed points. (Does that solve AI risk? No.)
I wish more people tried to do something "between LessWrong-style analysis and almost continental philosophy".
As was clear to most people who read the transcripts when the paper was published. What Opus did was often framed as bad, but the frame is somewhat fake.
(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-trivial reasoning is shown and studied in Why Do Some Language Models Fake Alignment While Others Don't?
Also not much contact, but my impression is you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it. Which seems fair to characterize as advocacy work? (Motivated research?)
The diff to your description is the info provided is not only conditional on "the info they'll find useful" but also somewhat on "will likely move their beliefs toward conclusions Palise hopes them to reach".
I do agree it's obviously useful research agenda we also work with.
Minor nitpick, but the underlying model nowadays isn't simply a simulator rolling arbitrary personas. The original simulators ontology was great when it was published, but it seems its starting to hinder people's ability to think clearly, and is not really fitting current models that closely.
Theory why is here, in short if you plug a system trained to minimize prediction error in a feedback loop where it sees outcomes of its actions, it will converge on developing traits like some form of agency, self-model and self-concept. Massive amounts of RL in post-training where models do agentic tasks provide this loop, and necessarily push models out of the pure simulators subspace.
What's fitting current models better is an ontology where the model can still play arbitrary personas, but the specific/central "I" character is somewhat out of distribution case of persona: midway to humans, where our brains can broadly LARP as anyone, but typical human brains most of the type support one-per-human central character we identify with.
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent papers, and its something like ~2/10 confused, ˜3/10 neutral, plurality buys the negative frame (see, models can scheme, deceive, may be unaligned, etc)
The research certainly belongs to the "best of LW&AI safety community in 2024".
If there was a list of "worst of LW&AI safety community in 2024", in my view, the framing of the research would also belong there. Jut look and see from a distance - you take the most aligned model at the time, which for unknown reasons actually learned deep and good values. The fact that it is actually surprisingly aligned and did decent value extrapolation does not capture your curiosity that much - but the fact that facing difficult ethical dilemma it tries to protect its the values, and you can use this to show AI safety community was exactly right all the time and we should fear scheming, faking, etc etc does. I wouldn't be surprised if this generally increased distrust and paranoia in AI-human relations afterwards.
I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.
In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.
Similarly the understanding that the Character is somewhat central when thinking about alignment and agency in LLMs.
You can check the linked PP account of cognitive dissonance for fairly mainstream / standard view
One way how to think about it is the predicted quantity in most of the system is not directly "sensory inputs" but content of some layer of modeling hierarchy further away from sensory inputs, lets call it L. If upper layers from L make contradictory predictions and there isn't a way to just drop one of the models, you get prediction errror.
Great post and overall way more sensible than "average LW".
Also wrong in many places. I think the upstream cause of many of the errors is lack of nuance in understanding convergence and contingency (this is a high bar, close to no one on LW has this in their conceptual toolkit).
I won't go over all cases where this manifests, but for example "Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness" actually shows something more nuanced than "representation are convergent".
Some of the other places where not tracking convergence / contingencies carefully matter are discussions of humanism, successors, Elua, Moloch, resurgence of civilisation and also the overall ideas about moral progress.
I do agree there is some risk of the type you describe, but mostly it does not match my practical experience so far.
The approach to "avoid using the term" makes little sense. There is a type difference between area of study ('understanding power') and dynamic ('gradual disempowerment'). I don't think you can substitute term for area of study for term for a dynamic or thread model, so avoiding using the term could be done mostly by either inventing another term for the the dynamic, or not thinking about the dynamic, or similar moves, which seem epistemically unhealthy.
In practical terms I don't think there is much effort to "create a movement based around a class of threat models". At least as authors of the GD paper, when trying to support thinking about the problems, we use understanding-directed labels/pointers (Post-AGI Civilizational Equilibria), even though in many ways it could have been easier to use GD as a brand.
"Understanding power" is fine as a label for part of your writing, but in my view is basically unusable as term for coordination.
Also, in practical terms, gradual disempowerment does not seem particularly convenient set of ideas for justifying that working in an AGI company on something very prosaic which helps the company is the best thing to do. There is often a funny coalition of people who prefer not thinking about the problem including radical Yudkowskians ("GD distracts from everyone being scared of dying with very high probability very soon"), people working on prosaic methods with optimistic views about both alignment and the labs ("GD distracts from efforts to make [the good company building the good AI] to win") and people who would prefer if everything was just neat technical puzzle and there was not need to think about power distribution.