Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

TL;DR: transformative AI(TAI) plausibly requires causal models of the world. Thus, a component of AI safety is ensuring secure paths to generating these causal models. We think the lens of causal models might be undervalued within the current alignment research landscape and suggest possible research directions. 

This post was written by Marius Hobbhahn and David Seiler. MH would like Richard Ngo for encouragement and feedback. 

If you think these are interesting questions and want to work on them, write us. We will probably start to play around with GPT-3 soonish. If you want to join the project, just reach out. There is certainly stuff we missed. Feel free to send us references if you think they are relevant. 

There are already a small number of people working on causality within the EA community. They include Victor Veitch, Zhijing Jin and PabloAMC. Check them out for further insights. There are also other alignment researchers working on causal influence diagrams (authors: Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg) whose work is very much related. 

Causality - a working definition:

Just to get this out of the way: we follow a broad definition of causality, i.e. we assume it can be learned from (some) data and doesn’t have to be put into the model by humans. Furthermore, we don’t think the representation has to be explicit, e.g. in a probabilistic model, but could be represented in other ways, e.g. in the weights of neural networks. 

But what is it? In a loose sense, you already know: things make other things happen. When you touch a light switch and a light comes on, that’s causality. There is a more technical sense in which no one understands causality, not even Judea Pearl (where does causal information ultimately come from if you have to make causal assumptions to get it? For that matter, how do we get variables out of undifferentiated sense data?). But it's possible to get useful results without understanding causality precisely, and for our purposes, it's enough to approach the question at the level of causal models.

Concretely: you can draw circles around phenomena in the world (like "a switch" and "a lightbulb") to make them into nodes in a graph, and draw arrows between those nodes to represent their causal relationships (from the switch to the lightbulb if you think the switch causes the lightbulb to turn on, or from the lightbulb to the switch if you think it's the other way around).

There’s an old Sequences post that covers the background in more detail. The key points for practical purposes are that causal models:

  1. Are sparse, and thus easy to reason about and make predictions with (or at least, easier to reason about than the joint distribution over all your life experiences).
  2. Can be segmented by observations. Suppose you know that the light switch controls the flow of current to the bulb and that the current determines whether the bulb is on or off.  Then, if you observe that there’s no current in the wire (maybe there’s a blackout), then you don’t need to know anything about the state of the switch to know the state of the bulb.
  3. Able to evaluate counterfactuals. If the light switch is presently off, but you want to imagine what would happen if it were on, your causal model can tell you (insofar as it’s correct).

Why does causality matter?

Causal, compared to correlational, information has two main advantages. For the following section, I got help from a fellow Ph.D. student.

1. Data efficiency

Markov factorization: Mathematically speaking, Markov factorization ensures conditional independence between some nodes given other nodes. In practice, this means that we can write a joint probability distribution as a sparse graph where only some nodes are connected if we assume causality. It introduces sparsity. 

“Namely, if we have a joint with n binary random variables, it would have 2^n - 1 independent parameters (the last one is determined to make the sum equal to 1). If we have k factors with n/k variables each, then we would have k(2^(n/k)  - 1) independent parameters. For n=20 and k=4, the numbers are 1048576 vs. 124.” - Patrik Reizinger

Independent Mechanisms: the independent mechanisms principle ensures that factors do not influence each other. Therefore, if we observe shifts in our data distribution, we only need to retrain a few parts of the model. If we observe global warming, for example, the vast majority of physics stays the same. We only need to recalibrate some parts of our model that relate to temperature and climate. Another example is the lightbulb blackout scenario from above. If you know there is a blackout, you don't need to flip the switch to know that the light won't turn on.

The conclusion of these two statements is that correlational models assume a lot more relations between variables than causal models and the entire model needs to be retrained every time the data changes. In causal models, however, we usually only need to retrain a small number of mechanisms. Therefore, causal models are much more sample efficient than correlational ones. 

2. Action guiding

Causal models introduce a very strong assumption on the model. Namely, variables are not just related, they are related in a directed way. Thus, causal models imply a testable hypothesis. If our causal model is that taking a specific drug reduces the severity of a disease, then we can test this with an RCT. So our model, drug -> disease, is a falsifiable hypothesis. 

The same thing is not possible for correlational models. If we say the intake of drugs correlates with the severity of the disease we say that either the drug helps with the disease, people who have less severe diseases take more drugs or both depend on a third variable. As soon as we intervene by fixing one variable and observing the other, we have already made a causal assumption. 

Correlational knowledge can still be used for actions--you can still take the drug and hope the causal arrow goes in the right direction. But it could also have a different effect than desired since you don’t know which variable is the cause and which one is the effect.

Causal models greatly improve the ability of models to make decisions and interact with their environment. Therefore we think it is highly plausible that transformative AI will have some causal model of the world. Due to the rise of data-driven learning, we expect this model to be learned from data, but we could also imagine some human interference or inductive biases.

Overall, we think that the thesis that causality matters for TAI is not very controversial but we think there are a lot of implications for AI safety that are not yet fully explored. 

Questions & Implications for AI safety:

If the causal models in ML algorithms have a large effect on their actions/predictions, we should really understand how they work. Some considerations include:

  1. Which causal models do current ML architectures have? Does GPT-3 have a causal model of the world and how can we find out? Can we find sets of prompts that give us relevant information about this question? Can interpretability tell us something about the internal causal model? 
    If our ML model has learned a slightly wrong causal model of the world, it will make incorrect predictions on data points outside of the training distribution. Therefore it seems relevant to understand which kind of model the algorithm is acting on. This is a subcategory of alignment and interpretability.
  2. What are the inductive biases of causal models? Do classification networks learn causality and do they even need to? We know from interpretability that they learn associations but is it more “If structure X is in the image then Y” or “Structure X and label Y seem related”. Which inductive biases do LLMs have wrt causality? Do RL architectures automatically learn causality because they intervene? 
    If we could say, for example, with higher certainty whether LLMs create internal causal (vs. correlational) models of the world, they might be easier to control and we could get higher certainty about their predictions.
  3. Do we need interventions to learn causal models efficiently? It seems intuitively plausible that interventions speed up learning but they are not strictly necessary. Economists, for example, use natural experiments to derive causal conclusions from observational data. While this is certainly nice, we don’t know whether a lot of observational data is sufficient to build large causal world models. 
    We are scared of ML algorithms increasingly interacting with the real world because if the interventions go wrong they can do a lot of harm. GPT-3 recently got hooked up to google and we expect someone to be mad enough to give it even more access to interventions on the internet. If there was a non-interventional way to get similar results, we would certainly prefer that.
  4. What is the difference in resource efficiency between humans and current ML algorithms? It is plausible that humans need less data to learn a new task than training current ML models from scratch. However, it is unclear how large that difference is when models are pre-trained to a comparable level of human pretraining from evolution. If we compare the time, for example, it takes humans to beat OpenAI five with the time it takes to train OpenAI five to beat these strategies again, we might get closer to the difference in resource efficiency. Some people have already asked whether GPT-3 is already sample-efficient (for fitting new data after pretraining). This could also be explored further.
    Having a better understanding of this difference in training efficiency might give us more insight into the quality of the world model of current algorithms.
  5. A worry: Our intuition is that humans have a bias to overidentify causality, i.e. see causality when it is not necessarily given. This might have been a good survival strategy for our ancestors since not identifying a causal mechanism is likely more deadly than incorrectly identifying one. However, in today’s complex world, this bias might be inappropriate. Just think about how many different stories of causal mechanisms are told after any election, most of which are simplistic and monocausal--"Hillary lost because of X”. 
    Our worry is that ML researchers, once they figure out how, will introduce a similar “overidentifying causality” inductive bias into models. This would mean that very powerful models with potentially big impacts have the causal model of a political pundit rather than a scientist. 
    Furthermore, since language models are trained on text that is generated by humans, they might just learn this bias on their own. Then, GPT-n would be as useless as the average political analysis.

What now?

We ask a lot of questions but don’t have many answers. Thus, we think the highest priority is to get a clearer picture, e.g. refine the questions, translate them into testable hypotheses and read more work from other scientists working on causality.

We think that reasonable first steps could be:

  1. Investigate GPT-3 wrt causality. BigBench is an effort to benchmark LLMs and it includes some questions about causality. But there are certainly more questions one could ask.
  2. Summarize the literature on causality from an AI safety perspective. The field of causality is large and scrambled across ML, economics, and physics. Just collecting and summarizing the different findings from an AI safety perspective seems like a promising start.
  3. Think about inductive biases and causality. Which models even allow for causal models? Which ones necessarily lead to them? Even high-level considerations without mathematical proofs might already be helpful.
  4. Summarize the literature on animals learning causal models. Surely some scientists have explored this question already, we just have to find them. Maybe it tells us something about AI.

If you think these are interesting questions and want to work on them, reach out. We will probably start to play around with GPT-3 soon. There is certainly research we missed. Feel free to send us references if you think they are relevant. 

Causality is not everything

We don’t want this to be another piece along the lines of “AI truly needs X to be intelligent” where X might be something vague like understanding/creativity/etc. We have the hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape. Not more, not less. 

Furthermore, we don’t need a causal model of everything. Correlations are often sufficient. For example, if you hear an alarm, you don’t need to know exactly what caused the alarm to be cautious. But knowing whether the alarm was caused by fire or by an earthquake will determine what the optimal course of action is. 

So we don’t think humans need to have a causal model of everything and neither do AIs but at least for safety-relevant applications, we should look into it deeper.


Causality might be one interesting angle for AI safety but certainly not the only one. However, there are a ton of people in classic ML who think that causality is the missing piece to AGI. They could be completely wrong but we think it’s at least worth exploring from an AI safety lens. 

In this post, we outlined why causality might be relevant for TAI, which kind of questions might be relevant and how we could start answering them. 


Is there a clear distinction between causality and correlation?

Some people will see our definition as naive and undercomplex. Maybe there is no such thing as causality and it’s all just different shades of correlation. Maybe all causal models are wrong and humans see something that isn’t. Maybe, maybe, maybe. 

Similar to how there is no hard evidence for consciousness and philosophical zombies that act just as if they were conscious but truly aren't could exist, all causal claims could also be explained with a lot of correlations and luck. But as argued, e.g. by Eliezer, Occam's razor would make the existence of some sort of consciousness much more likely than its absence and by the same logic causality more likely than its absence.


Ω 4

11 comments, sorted by Click to highlight new comments since: Today at 12:44 AM
New Comment

Thanks Marius and David, really interesting post, and super glad to see interest in causality picking up!

I very much share your "hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape."

Most relevant, I've been working with Mary Phuong on a project which seems quite related to what you are describing here. I don't want to share too many details publicly without checking with Mary first, but if you're interested perhaps we could set up a call sometime?

I also think causality is relevant to AGI safety in several additional ways to those you mention here. In particular, we've been exploring how to use causality to describe agent incentives for things like corrigibility and tampering (summarized in this post), formalizing ethical concepts like intent, and understanding agency.

So really curious to see where your work is going and potentially interested in collaborating!

I'm very interested in a collaboration!! Let's switch to DMs for calls and meetings.

I like the direction you're going with this. I agree that causal reasoning is necessary, but not sufficient, for getting alignable TAI. I think just getting step 2 done (Summarize the literature on causality from an AI safety perspective) could have a huge impact in terms of creating a very helpful resource for AI alignment researches to pull insights from.

Our worry is that ML researchers, once they figure out how, will introduce a similar “overidentifying causality” inductive bias into models. This would mean that very powerful models with potentially big impacts have the causal model of a political pundit rather than a scientist.

One possible workaround for this would be to take a Bayesian approach. Bayes' rule is all about comparing the predictions (likelihoods) of different models (hypotheses) to assign higher probability mass to those models with greater predictive power.

Consider a system that uses an ensemble of differently structured causal models, each containing slots for different factors (e.g., {A, B} (no causal relationship), {A -> B}, {A <- B}, {A <- C -> B}, ...). Then for any given phenomenon, the system could feed in all relevant factors into the slots of the causal graph of each model, then use each model to make predictions, both about passive observations and about the results of interventions. Those causal (or acausal) models with the greatest predictive power would win out after the accumulation of enough evidence.

Of course, the question still remains about how the system chooses which factors are relevant, or about how it decides what kind of state transformations each causal arrow induces. But I think the general idea of multiple hypothesis testing should be sufficient to get any causally reasoning AI to think more like a scientist than a pundit.

There is a more technical sense in which no one understands causality, not even Judea Pearl

I feel like this is too strong, at least in the way I read it, though maybe I misunderstand. The questions you raise do not seem too hard to address:

(where does causal information ultimately come from if you have to make causal assumptions to get it?

Evolution makes the causal assumption that organic replicators are viable (which fails in many places, e.g. when environments do not provide water, negentropy, a broad mix of chemicals, and which depends on a variety of causal properties, like stability), as well as various other causal assumptions. Further, we have "epistemic luck" in being humans, the species of animal that is probably the most adapted for generality, making us need advanced rather than just simplistic causal understanding.

For that matter, how do we get variables out of undifferentiated sense data?)

There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.

There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.


I'd be curious to hear more about this, if you have some pointers


I wrote "etc.", but really the main ones I can think of are probably the ones I listed there. Let's start with correlations since this is the really old-school one.

The basic causal principle behind a correlation/covariance-based method is that if you see a correlation where the same thing appears in different places, then that correlation is due to there being a shared cause. This is in particular useful for representation learning, because the shared cause is likely not just an artifact of your perception ("this pixel is darker or lighter") but instead a feature of the world itself ("the scene depicted in this image has properties XYZ"). This then leads to the insight of Factor Analysis[1]: it's easy to set up a linear generative model with a fixed number of independent latent variables to model your data.

Factor Analysis still gets used a lot in various fields like psychology, but for machine perception it's too bad because perception requires nonlinearity. (Eigenfaces are a relic of the past.) However, the core concept that correlations imply latent variables, and that these latent variables are likely more meaningful features of reality contains to be relevant in many models:

  • Variational autoencoders, generative adversarial networks, etc., learn to encode the distribution of images, and tend to contain a meaningful "latent space" that you can use to generate counterfactual images at a high level of abstraction. They rely on covariances because they must fit the latent variables so that they capture the covariances between different parts of the images.
  • Triplet loss for e.g. facial recognition tries to filter out the features that correlate for different images of a single person, vs the features that do not correlate for people on different images and are thus presumably artifacts of the image.

Covariance-based techniques aren't the only game in town, though; a major alternate one is symmetries. Often, we know that some symmetries hold; for instance reality is symmetric under translations, but images have to pick some particular translation to capture only a specific part of reality. However, small perturbations in the translation of an image do not matter much for its contents, as the contents tends to be extended geometrically over a large part of the image. Therefore, if you do random cropping of your images, you still get basically the same images. Similarly, various other data augmentation methods can be viewed as being about symmetries. In particular there are a number of symmetry-oriented feature learners:

  • Facebook's DINO model (as well as various other models in this general class) trains an embedding to be invariant to various symmetries. As a result, it learns to pay attention to the objects in the scene.
  • I really like the concept in pi-GAN. Rather than training it with symmetries given by image augmentation over the input data, they train it using rotation symmetries over the latent variables. This allows it to learn 3D representations of faces and objects in an unsupervised fashion.

Finally, there is the notion of conserved properties. John Wentworth has persuasively argued that variables that act at a distance must have deterministically conserved mediators. I am only aware of one architecture that uses this insight, though, namely Noether Networks. Basically, Noether Networks are trained to identify features where your predictions can be improved by assuming the features to stay constant. They do not directly use these features for much, but they seem potentially promising for the future.

This set is probably not exhaustive. There's probably methods that are hyper-specialized to specific domains, and there's probably also other general methods. But I think it's a good general taste of what's available?

  1. ^

    Factor Analysis might not be a familiar algorithm to you, but it is computationally basically equivalent to Principal Component Analysis, which you are almost certainly familiar with. There's only a few caveats for this, and they don't really matter for the purposes of this post.

Nice summary, thanks for sharing!

You state in the first comment that they can be given causal justification.  As far as I understand you argue with covariances above. Can you elaborate on what this causal justification is?

In a causal universe, if you observe things in different places that correlate with each other, they must have a common cause. That's the principle VAEs/triplet losses/etc. can be understood as exploiting.

Right, but Reichenbach's principle of common cause doesn't tell you anything about how they are causally related? They could just be some nodes in a really large complicated causal graph. So I agree that we can assume causality somehow but we are much more interested in how the graph looks like, right?

So I agree that we can assume causality somehow but we are much more interested in how the graph looks like, right?

Not necessarily? Reality is really really big. It would be computationally infeasible to work with raw reality. Rather, you want abstractions that cover aggregate causality in a computationally practical way, throwing away most of the causal details. See also this: