This is our current distillation of the sharp left turn threat model and an attempt to make it more concrete. We will discuss our understanding of the claims made in this threat model, and propose some mechanisms for how a sharp left turn could happen. This is a work in progress, and we welcome feedback and corrections.
Claim 1. Capabilities will generalize far (i.e., to many domains)
There is an AI system that:
Generalization is a key component of this threat model because we're not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable.
Some optional sub-claims can be made that increase the risk level of the threat model:
Claim 1a [Optional]: Capabilities (in different "domains") will all generalize at the same time
Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously)
Claim 2. Alignment techniques that worked previously will fail during this transition
Claim 3: Humans can’t intervene to prevent or align this transition
Capabilities will generalize far if learning / reasoning / cognitive work is done outside of the optimization process, similarly to how human cultural evolution happens outside genetic evolution and proceeds faster. Here are some mechanisms for cognitive work getting done outside the optimization process:
A rapid phase transition happens if there is a capability overhang: the AI system is improving at various skills continuously, but its improvement in many domains is bottlenecked on one specific skill, and at some point it receives some input that makes its existing capabilities much more effective. Here are some ways this can happen:
We will discuss mechanisms for Claim 2 in a future post.
I still endorse the breakdown of "sharp left turn" claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.
This post could be improved by explicitly relating the claims to the "consensus" threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims:
Claim 1: there is an AI system that (1) performs well ... (2) generalizes far outside of its training distribution.
Don't humans provide an existence proof of this? The point about there being a 'core' of general intelligence seems unnecessary.
I agree that humans satisfying the conditions of claim 1 is an argument in favour of it being possible to build machines that do the same. A couple of points: I think the threat model would posit the core of general intelligence as the reason both why humans can do these things and why the first AGI we build might also do these things. Claim 1 should perhaps be more clear that it's not just saying such an AI design is possible, but that it's likely to be found and built.
No, Humans do not satisfy this assumptions adopted here, unless you make this more specific.
The definition of Generalize is given above as: "Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning".
Whether you think humans do this depends on what you take for "new domains" and "perform well".
Humans taught to crawl on hardwood floors can crawl on carpeted floors. Humans taught to hunt fly larva will need further training to hunt big game.