This post spun off from the work I was doing at UChicago’s XLab over this summer. Thanks to Jay Kim, Jo Jiao, and Aryan Bhatt for feedback on this post! Also, thanks to Zack Rudolph, Rhea Kanuparthi, and Aryan Shrivastava for organizing & facilitating an incredible fellowship this summer.
This post sketches out a frame to think about broad classes of AI risk.[1]The goal is to help people identify gaps in safety cases, and, by doing so, identify neglected directions in AI Safety research. Epistemic status: exploratory.
One sentence summary: To understand the risks AI systems pose, it is useful to look at affordances they have available to act upon the world.[2]
If you work in AI Safety, you’re probably worried about how these powerful systems will, in the present and in the future, affect the way we live. Ie. you’re interested in the impact of AI “on the real world”. This impact can be either the direct or indirect result of actions taken by these systems. Direct risks include: malicious actors using coding agents to perform cyberattacks; LLMs telling users how to synthesize drugs; or sycophantic models that induce psychotic breaks. Indirect risks include: AI race dynamics escalating to conflicts between nations; automation of various parts of the economy causing unemployment; and gradual disempowerment scenarios.
Analysing the affordances of AI systems can help us prepare for the former category of risks (but less so for the latter). Why? Well, with indirect risks, (a) we might no clearly understand the relationship between the actions of an AI system and bad outcomes, and (b) there might be other, potentially more tractable levers to avoid bad outcomes. For example, with race dynamics between nations, we’re not sure what threshold of capabilities, when crossed, might lead to conflict![3]Also, international cooperation might be a more tractable lever to avoid race dynamics than (say) limiting capabilities to a ceiling safe enough to avoid sparking race dynamics. The latter feels convoluted to even think about.
Current AI systems (ie. LLM-based systems) have few affordances through which they can actually ‘do things’ in the real world. The biggest channels seem to be: coding, tool use, and maybe human persuasion. I’ll briefly discuss the potential risks and interventions on each of these affordances; but before that, some caveats:
Coding
Tool Use
The threat vectors / mediating pathways largely vary tool-by-tool. That being said, here’s a bit about tool use in the general
Human Persuasion
This is a bit more speculative, and I was finding it hard to say something concrete.
This is not a novel frame by any means! It’s roughly security mindset applied to AI safety. It’s discussed explicitly and implicitly with work done on safety cases and control. ↩︎
An affordance roughly means – 'the actions available to the AI system'. ↩︎
Note my eliding between actions that one specific AI system might take vs the broader frontier of AI capabilities ↩︎
Note: here, this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). We’re focusing on the mediating pathway directly, as opposed to multiple potential upstream causes.
Sometimes this is useful. Other times, if (say) we only have a few upstream causes, it might be useful to focus on them instead. ↩︎
Note my eliding between actions that one specific AI system might take, vs the broader range of potential AI capabilities ↩︎
Note: here, this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). We’re focusing on the mediating pathway directly, as opposed to multiple potential upstream causes.
Sometimes this is useful. Other times, if (say) we only have a few upstream causes, it might be useful to focus on them instead. ↩︎
Alternatively, a more ‘fundamental’ taxonomy might bundle code-writing & code-execution under text output & tool use respectively. ↩︎
Note: this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). It gives us more leverage to focus on the narrow mediating pathway directly, as opposed to the multiple upstream causes. ↩︎
Either simply through comments, or through more involved tracking in the IDE, or something in between. ↩︎