Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog: vkrakovna.wordpress.com
I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience.
I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency).
I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.
This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims:
I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk.
In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), which I would summarize as "non-deceptiveness is anti-natural / hard to disentangle from general capabilities". I would probably put this under "SG → MAPS", assuming an irreducible kind of specification gaming where it's very difficult (or impossible) to distinguish deceptiveness from non-deceptiveness (including through feedback on the model's reasoning process). Though it could also be GMG, where the "non-deceptiveness" concept is incoherent and thus very difficult to generalize well.
I'm glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven't rerun the survey so I don't really know. Looking back at the "possible implications for our work" section, we are working on basically all of these things.
Thoughts on some of the cruxes in the post based on last year's developments:
I hoped to see other groups do the survey as well - looks like this didn't happen, though a few people asked me to share the template at the time. It would be particularly interesting if someone ran a version of the survey with separate ratings for "agreement with the statement" and "agreement with the implications for risk".
I agree that a possible downside of talking about capabilities is that people might assume they are uncorrelated and we can choose not to create them. It does seem relatively easy to argue that deception capabilities arise as a side effect of building language models that are useful to humans and good at modeling the world, as we are already seeing with examples of deception / manipulation by Bing etc.
I think the people who think we can avoid building systems that are good at deception often don't buy the idea of instrumental convergence either (e.g. Yann LeCun), so I'm not sure that arguing for correlated capabilities in terms of intelligence would have an advantage.
Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.
The issue with being informal is that it's hard to tell whether you are right. You use words like "motivations" without defining what you mean, and this makes your statements vague enough that it's not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn't seems to rule out that shards can be modeled as contextually activated subagents with utility functions.)
An upside of formalism is that you can tell when it's wrong, and thus it can help make our thinking more precise even if it makes assumptions that may not apply. I think defining your terms and making your arguments more formal should be a high priority. I'm not advocating spending hundreds of hours proving theorems, but moving in the direction of formalizing definitions and claims would be quite valuable.
It seems like a bad sign that the most clear and precise summary of shard theory claims was written by someone outside your team. I highly agree with this takeaway from that post: "Making a formalism for shard theory (even one that’s relatively toy) would probably help substantially with both communicating key ideas and also making research progress." This work has a lot of research debt, and paying it off would really help clarify the disagreements around these topics.
Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like "We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts".
Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.
I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the "strong distributional shift" assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can't generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That's why the title says "power-seeking can be predictive" not "training-compatible goals can be predictive".
The hypothesis you mentioned seems compatible with the assumptions of this post. When you say "the policy develops motivations related to obvious correlates of its historical reinforcement signals", these "motivations" seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals.
I suspect a lot of the disagreement here comes from different interpretations of the "internal representations of goals" assumption, I will try to rephrase that part better.