Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

By Tom Everitt, Lewis Hammond, Rhys Ward, Ryan Carey, James Fox, Sebastian Benthall, Matt MacDermott and Shreshth Malik representing the Causal Incentives Working Group. Thanks also to Toby Shevlane, MH Tessler, Aliya Ahmad, Zac Kenton, Maria Loks-Thompson, and Alexis Bellot.

Over the next few years, society, organisations, and individuals will face a number of fundamental questions stemming from the rise of advanced AI systems: 

  • How to make sure that advanced AI systems do what we want them to (the alignment problem)?
  • What makes a system safe enough to develop and deploy, and what constitutes sufficient evidence of that?
  • How do we preserve our autonomy and control as decision making is increasingly delegated to digital assistants?

A causal perspective on agency provides conceptual tools for navigating the above questions, as we’ll explain in this sequence of blog posts. An effort will be made to minimise and explain jargon, to make the sequence accessible to researchers from a range of backgrounds.


First, with agent we mean a goal-directed system that acts as if it it is trying to steer the world in some particular direction(s). Examples include animals, humans, and organisations (more on agents in a subsequent post). Understanding agents is key to the above questions. Artificial agents are widely considered the primary existential threat from AGI-level technology, whether they emerge spontaneously or through deliberate design.  Despite the myriad risks to our existence, highly capable agents pose a distinct danger, because many goals can be achieved more effectively by accumulating influence over the world. Whereas an asteroid moving towards earth isn’t intending to harm humans and won’t resist redirection, misaligned agents might be distinctly adversarial and active threats.

Second, the preservation of human agency is critical in the approaching technological transition, for both individuals and collectives. Concerns have already been raised that manipulative social media algorithms and content recommenders undermine users’ ability to focus on their long-term goals. More powerful assistants could exacerbate this. And as more decision-making is delegated to AI systems, the ability of society to set its own trajectory comes into question

Human agency can also be nurtured and protected. Helping people to help themselves is less paternalistic than directly fulfilling their desires, and fostering empowerment may be less contingent on complete alignment than direct satisfaction of individual preferences. Indeed, self-determination theory provides evidence that humans intrinsically value agency, and some human rights can be interpreted as “protections of our normative agency”.

Third, artificial agents might themselves eventually constitute moral patients. A clearer understanding of agency could help us refine our moral intuitions and avoid unethical actions. Some ethical dilemmas might be possible to avoid altogether by only designing artificial systems that lack moral patienthood.

Key questions

One hope for our research is that it would build up a theory of agency. Such a theory would ideally answer questions such as:

  • What are the possible kinds of agents that can be created, and along what dimension can they differ? The agents we’ve seen so far primarily include animals, humans, and human organisations, but the range of possible goal-directed systems is likely much larger than that.
  • Emergence: how are agents created? For example, when might a large language model become agentic? When does a system of agents become a “meta-agent”, such as an organisation?
  • Disempowerment: how is agency lost? How do we preserve and nurture human agency?
  • What are the ethical demands posed by various types of systems and agents?
  • How to recognise agents and measure agency? A concrete operationalization would help us to detect agency in artificial systems, and agency loss in humans.
  • How to predict agent behaviour? What behaviour is incentivised and how do agents generalise to new situations? If we understand the impact of the behaviour, we may also be able to anticipate danger.
  • What are the possible relationships between agents? Which are harmful and which are beneficial?
  • How do we shape agents, to make them safe, fair, and beneficial?


Causality is helpful for understanding agents. Philosophers have been interested in causality for a long time, not just because the exact relationship between a cause and an effect is intellectually intriguing, but because it underpins so many other concepts, many of which are relevant to understanding agents and designing safe AGI.

For example, both influence and response are causal concepts. We want agents that influence the world in positive ways, and respond appropriately to instructions. A range of other other relevant concepts also build on causality:

  • Agency, since a goal-directed system is one where the system’s goals are driving (causing) its behaviour.
  • Intention, which relates to the reasons for (causes of) action, and to goal-directed means-end influence. Intent is an important concept when ascribing legal and moral responsibility
  • Harmmanipulation, and deception which consider how your well-being, actions, or beliefs were influenced and which are usually taken to be intentional.
  • Fairness in how e.g. a hiring decision responds to personal attributes such as gender or race and make them influence the outcome.
  • Robust generalisation to changes in the environment is much easier for agents with a causal model of their environment.
  • Hypotheticals / counterfactuals, as the alternative world deviates from the actual world based on one or more causal interventions.

The tree of causality

The rest of this sequence will explain in more detail how these concepts are grounded in causality, and the research this has led to. We hope this will enable and motivate other researchers to join our effort of building a formal theory of safe A(G)I based on causal foundations. Much of our recent work fits into this vision. For example, in discovering agents and reasoning about causality in games, we developed a better understanding of how to represent various aspects of reality with causal models. With the agent incentives paper, we showed how such models can be analysed to reveal safety-relevant properties. And with path-specific objectives, we illustrated how this kind of analysis can inspire improved designs.

We hope this will complement other research directions crucial to safe AGI, like scalable alignmentdangerous capability evaluationsrobustnessinterpretabilityethicspolicy and governanceforecastingagent foundations, and risk mapping


We hope that a causality-based understanding of agency and related aspects will help designers of AI systems by clarifying the space of possibilities for agents, and how to avoid especially risky configurations. It may help regulators with a better picture of what to look out for, and what should count as sufficient evidence of safety. It may help us all decide what behaviour is acceptable towards what kinds of systems. And finally, but not least, it may help individuals understand what it is that they seek to preserve and enhance in their interactions with artificially intelligent systems.

In the next post, we explain causality and causal models in more detail, covering Pearl’s different causal models, and how they can be generalised to account for the presence of one or more agents.

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 8:37 AM

Comment one: Paternalism, besides fulfilling desires for you, tells you what your desires are. Its nurturing, but not in a nurturing way. Authoritarianism doesn't bother doing any of that, your agency/making is a problem, a threat, because it exists. It is a uni-world in structure much like a narcissists psychopath's view of themselves where self=world, and both these de-nurturing parasites survive and co-evolve with us, where we support and acquiesce by de-agenting ourselves as only agents can. The self/world is a ratio of sorts, of partials ranked and intermixed: like Intention/preferences/inclinations/wants-needs/biases/values. (keyword Janus).

Comment 2: Intention/preferences/inclinations/wants-needs/biases/values need not be conscious per se, nor even subconsciousness, to affect learning. An intimation of weight can direct the organisation of legs/walking in robotic spiders. (I'll get back here with a reference to this work from early to mid 2000s, might take a few days). It is easy in one's mind's eye to swap a mass with a drive (e.g. hunger) and see how an animal that moves arises and moves-over/absorbing the flow of energy that plants delay. And thus the body/world is made as it own thing from the substance of the terrain as a substrate of the landscape composed with the body moving over it, until involuted into psychology within their society/polities/economies, nurturing or otherwise.

My frameworks: I write on world/worlding/worldbuilding from a poetic interdisciplinary position with nods to moral philosophy as an applied tech involving some psychology (if a bit pop) , over at

Follows up on what I learned from primatologists:   Based on the other comments regarding methodologies (frameworks) I predict more poetry will be required, both to reframe our assumptions (disruption), and to map between  frameworks (that we are moral urged to should (be made) ) in order to communicate this research re-building worlds of us (as we should). 

 I have below average mathematical understanding.

I will follow the work and debates on agency and modelling with interest.

So the contributions of vnm theory are shrunken down into "intention"? Will you recapitulate that sort of framing (such as involving the interplay between total orders and real numbers) or are you feeling more like it's totally wrong and should be thrown out?

So the contributions of vnm theory are shrunken down into "intention"?

(Background: I consider myself fairly well-read w.r.t. causal incentives, not very familiar with vnm theory, and well-versed in Pearlian causality. I have gotten a sneak peak at this sequence so have a good sense of what's coming)

I'm not sure I understand VNM theory, but I would suspect the relationship is more like "VNM theory and <this agenda> are two takes on how to reason about the behavior of agents, and they both refer to utilities and Bayesian networks, but have important differences in their problem statements (and hence, in their motivations, methodologies, exact assumptions they make, etc)".

I'm not terribly confident in that appraisal at the moment, but perhaps it helps explain my guess for the next question:

Will you recapitulate that sort of framing (such as involving the interplay between total orders and real numbers)

Based on my (decent?) level of familiarity with the causal incentives research, I don't think there will be anything like this. Just because two research agendas use a few of the same tools doesn't mean they're answering the same research questions, let alone sharing methodologies.

...or are you feeling more like it's totally wrong and should be thrown out?

When two different research agendas are distinct enough (as I suspect VNM and this causal-framing-of-AGI-safety are), their respective success/failures are quite independent. In particular, I don't think the authors' choice to pursue this research direction over the last few years should be taken by itself as a strong commentary on VNM.

But maybe I didn't fully understand your comment, since I haven't read up on VNM.

I'm not sure I entirely understand the question, could you elaborate? Utility functions will play a significant role in follow-up posts, so in that sense we're heavily building on VNM.

Yeah, what I meant was that "goals" or "preferences" are often emphasized front and center, but here not so much, because it seems like you want to reframe that part under the banner of "intention"

A range of other other relevant concepts also build on causality:

It just felt a little odd to me that so much bubbled up from your decomposition except utility, but you only mention "goals" as this thing that "causes" behaviors without zeroing in on a particular formalism. So my guess was that vnm would be hiding behind this "intention" idea.

Preferences and goals are obviously very important. But I'm not sure they are inherently causal, which is why they don't have their own bullet point on that list.  We'll go into more detail in subsequent posts