Progress on Causal Influence Diagrams
By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg Crossposted from DeepMind Safety Research About 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence diagrams? A key problem in AI alignment is understanding agent incentives. Concerns have been raised that agents may be incentivized to avoid correction, manipulate users, or inappropriately influence their learning. This is particularly worrying as training schemes often shape incentives in subtle and surprising ways. For these reasons, we’re developing a formal theory of incentives based on causal influence diagrams (CIDs). Here is an example of a CID for a one-step Markov decision process (MDP). The random variable S1 represents the state at time 1, A1 represents the agent’s action, S2 the state at time 2, and R2 the agent’s reward. The action A1 is modeled with a decision node (square) and the reward R2 is modeled as a utility node (diamond), while the states are normal chance nodes (rounded edges). Causal links specify that S1 and A1 influence S2, and that S2 determines R2. The information link S1 → A1 specifies that the agent knows the initial state S1 when choosing its action A1. In general, random variables can be chosen to represent agent decision points, objectives, and other relevant aspects of the environment. In short, a CID specifies: * Agent decisions * Agent objectives * Causal relationships in the environment * Agent information constraints These pieces of information are often essential when trying to figure out an agent’s incentives: how an objective can be achieved depends on how it is causally related to other (influenceable) aspects in the environment, and an agent’s optimization is constrained by what information it has access to. In many cases, the qualitative judgements expressed by a (non-parameterized) C
it's true it's cool, but I suspect he's been a bit disheartened by how complicated it's been to get this to work in real-world settings.
in the book of why, he basically now says it's impossible to learn causality from data, which is a bit of a confusing message if you come from his previous books.
but now with language models, I think his hopes are up again, since models can basically piggy-back on causal relationships inferred by humans