Research Scientist at DeepMind

Wiki Contributions


Investigating causal understanding in LLMs

Really interesting, even though the result aren't that surprising. I'd be curious to see how the results improve (or not) with more recent language models. I also wonder if there are other formats to test causal understanding. For example, what if receives a more natural story plot (about Red Riding Hood, say), and asked about some causal questions ("what would have happened if grannma wasn't home when the wolf got there?", say).

It's less clean, but it could be interesting to probe it in a few different ways.

Various Alignment Strategies (and how likely they are to work)

Nice post! The Game Theory / Bureaucracy is interesting. It reminds me of Drexler's CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/combine these systems.

Causality, Transformative AI and alignment - part I

Thanks Marius and David, really interesting post, and super glad to see interest in causality picking up!

I very much share your "hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape."

Most relevant, I've been working with Mary Phuong on a project which seems quite related to what you are describing here. I don't want to share too many details publicly without checking with Mary first, but if you're interested perhaps we could set up a call sometime?

I also think causality is relevant to AGI safety in several additional ways to those you mention here. In particular, we've been exploring how to use causality to describe agent incentives for things like corrigibility and tampering (summarized in this post), formalizing ethical concepts like intent, and understanding agency.

So really curious to see where your work is going and potentially interested in collaborating!

Causality, Transformative AI and alignment - part I

There are numerous techniques for this, based on e.g. symmetries, conserved properties, covariances, etc.. These techniques can generally be given causal justification.


I'd be curious to hear more about this, if you have some pointers

Progress on Causal Influence Diagrams

Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that's the benefit of speaking the same language!)

We would also be curious to see a draft of the MDP-generalization once you have something ready to share!

AMA: Paul Christiano, alignment researcher


  • I think the existing approach and easy improvements don't seem like they can capture many important incentives such that you don't want to use it as an actual assurance (e.g. suppose that agent A is predicting the world and agent B is optimizing A's predictions about B's actions---then we want to say that the system has an incentive to manipulate the world but it doesn't seem like that is easy to incorporate into this kind of formalism).


This is what multi-agent incentives are for (i.e. incentive analysis in multi-agent CIDs).  We're still working on these as there are a range of subtleties, but I'm pretty confident we'll have a good account of it.

Counterfactual control incentives

Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there. 

(Sorry btw for slow reply; I keep missing alignmentforum notifications.)

Counterfactual control incentives

Thanks Stuart and Rebecca for a great critique of one of our favorite CID concepts! :)

We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent. 

What control incentives do capture are the instrumental goals of the agent. Controlling X can be a subgoal for achieving utility if and only if the CID admits a control incentive on X. For this reason, we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to "control as a side effect.

(A -> B) -> A in Causal DAGs

Glad you liked it.

Another thing you might find useful is Dennett's discussion of what an agent is (see first few chapters of Bacteria to Bach). Basically, he argues that an agent is something we ascribe beliefs and goals to. If he's right, then an agent should basically always have a utility function.

Your post focuses on the belief part, which is perhaps the more interesting aspect when thinking about strange loops and similar.

Load More