My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
As was clear to most people who read the transcripts when the paper was published. What Opus did was often framed as bad, but the frame is somewhat fake.
(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-trivial reasoning is shown and studied in Why Do Some Language Models Fake Alignment While Others Don't?
Also not much contact, but my impression is you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it. Which seems fair to characterize as advocacy work? (Motivated research?)
The diff to your description is the info provided is not only conditional on "the info they'll find useful" but also somewhat on "will likely move their beliefs toward conclusions Palise hopes them to reach".
I do agree it's obviously useful research agenda we also work with.
Minor nitpick, but the underlying model nowadays isn't simply a simulator rolling arbitrary personas. The original simulators ontology was great when it was published, but it seems its starting to hinder people's ability to think clearly, and is not really fitting current models that closely.
Theory why is here, in short if you plug a system trained to minimize prediction error in a feedback loop where it sees outcomes of its actions, it will converge on developing traits like some form of agency, self-model and self-concept. Massive amounts of RL in post-training where models do agentic tasks provide this loop, and necessarily push models out of the pure simulators subspace.
What's fitting current models better is an ontology where the model can still play arbitrary personas, but the specific/central "I" character is somewhat out of distribution case of persona: midway to humans, where our brains can broadly LARP as anyone, but typical human brains most of the type support one-per-human central character we identify with.
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent papers, and its something like ~2/10 confused, ˜3/10 neutral, plurality buys the negative frame (see, models can scheme, deceive, may be unaligned, etc)
The research certainly belongs to the "best of LW&AI safety community in 2024".
If there was a list of "worst of LW&AI safety community in 2024", in my view, the framing of the research would also belong there. Jut look and see from a distance - you take the most aligned model at the time, which for unknown reasons actually learned deep and good values. The fact that it is actually surprisingly aligned and did decent value extrapolation does not capture your curiosity that much - but the fact that facing difficult ethical dilemma it tries to protect its the values, and you can use this to show AI safety community was exactly right all the time and we should fear scheming, faking, etc etc does. I wouldn't be surprised if this generally increased distrust and paranoia in AI-human relations afterwards.
I'm quite happy about this post: even while people make the conceptual rounding error of rounding it to Januses Simulators, it was actually meaningful update, and year later is still something I point people to.
In the meantime it become clear to more people Characters are deeper/more unique than just any role, and the result is closer to humans than expected. Our brains are also able to run many different characters, but the default you character is somewhat unique, priviledged and able to steer the underlying computation.
Similarly the understanding that the Character is somewhat central when thinking about alignment and agency in LLMs.
You can check the linked PP account of cognitive dissonance for fairly mainstream / standard view
One way how to think about it is the predicted quantity in most of the system is not directly "sensory inputs" but content of some layer of modeling hierarchy further away from sensory inputs, lets call it L. If upper layers from L make contradictory predictions and there isn't a way to just drop one of the models, you get prediction errror.
Great post and overall way more sensible than "average LW".
Also wrong in many places. I think the upstream cause of many of the errors is lack of nuance in understanding convergence and contingency (this is a high bar, close to no one on LW has this in their conceptual toolkit).
I won't go over all cases where this manifests, but for example "Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness" actually shows something more nuanced than "representation are convergent".
Some of the other places where not tracking convergence / contingencies carefully matter are discussions of humanism, successors, Elua, Moloch, resurgence of civilisation and also the overall ideas about moral progress.
I don't think this captures the counterarguments well. So here is one
You can imagine a spectrum of funders where on one hand, you have people who understand themselves as funders and want to be marshaling an army to solve AI alignment. On the other side, you have basically researchers who see work that should be done, don't have capacity to do the work themselves, and this leads them to create teams and orgs - "reluctant founders".
It's reasonable to be skeptical about what the "funder type" end of the spectrum will do.
In normal startups, the ultimate feedback loop is provided by the market. In AI safety nonprofits, the main feedback loops are provided by funders, AGI labs, and Bay Area prestige gradients.
Bay Area prestige gradients are to a large extent captured by AGI labs - the majority of quality-weighted "AI safety" already works there, the work is "obviously impactful", you are close to the game, etc. also normal ML people also want to work there.
If someone wants to scale a lot, "funders" means mostly OpenPhil - no other source would fund the army. The dominant OpenPhil worldview is closely related to Anthropic - for example, until recently you have hear from senior OP staff that working in the labs is often strategically the best thing you can do.
Taken together, it's reasonable to expect the "funder type" to be captured by the incentive landscape and work on stuff that is quite aligned with AGI developers / what people working there want, need, or endorse, and/or what OP likes.
(A MATS skeptic could say this is also true about MATS: the main thing going on seems to be recruiting and training ML talent to work for "the labs"; in this perspective, given that AI safety is funding constrained, it seems unclear why scarce AI safety funding is best deployed to make recruitment & training easier for extremely well resourced companies)
Personally I'm more optimistic about people somewhere like ˜70% of the spectrum toward the research side, who mostly have some research taste, strategy, judgement... but I don't think you attract them by the interventions you propose
This post makes a brave attempt to clarify something not easy to point to, and ends up somewhere between LessWrong-style analysis and almost continental philosophy, sometimes pointing toward things beyond the reach of words with poetry - or at least references to poetry.
In my view, it succeeds in its central quest: creating a short handle for something subtle and not easily legible.
The essay also touches on many tangential ideas. Re-reading it after two years, I'm noticing I've forgotten almost all the details and found the text surprisingly long. The handle itself, though, stuck.
Evaluating deep atheism
Having the handle of "deep atheism", some natural questions - partially discussed in the text - are "is deep atheism right", "should people believe deep atheism" and "should people Believe In deep atheism".
My current guess is evaluating the truthfulness of "deep atheism" is likely at or beyond limits to legibility. Human values are not really representable as legible reasoning, complex priors about the general nature of reality are also not really representable by complex reasoning, and the neural substrate is not transferable between brains. "The justification engine" - or a competent philosopher or persuasive writer - can create stories or arguments pushing one way or another, but I'm somewhat sceptical the epistemic structure really rests on the arguments.
I'm not in favour of ordinary mortals trying to "Believe In deep atheism" and would not expect that to lead to good consequencdes.
Moral realism
The section I like the least is "Are moral realists theists?" I don't think "Good just sits outside of Nature, totally inaccessible, and we guess wildly about him on the basis of the intuitions that Nature put into our heart" represents the strongest version of moral realism.
My preferred versions of quasi-moral-realism give moral claims a status similar to mathematics. Do Real numbers sit outside Nature, totally inaccessible? I'd say no. Would aliens use them? That's an empirical question about convergent evolution of abstractions. I'd be surprised if any advanced reasoner in this universe didn't use something equivalent to natural numbers. For Reals, I'd guess it's easy to avoid Zermelo–Fraenkel set theory specifically, but highly convergent to develop something like a number line.
What does this tell us about Good? You can imagine something like the process described in Acausal Normalcy leads to some convergent moral fixed points. (Does that solve AI risk? No.)
I wish more people tried to do something "between LessWrong-style analysis and almost continental philosophy".