AI Safety – Analyse Affordances

atharva

This post spun off from the work I was doing at UChicago’s XLab over this summer. Thanks to Jay Kim, Jo Jiao, and Aryan Bhatt for feedback on this post! Also, thanks to Zack Rudolph, Rhea Kanuparthi, and Aryan Shrivastava for organizing & facilitating an incredible fellowship this summer.

This post sketches out a frame to think about broad classes of AI risk.^[1]The goal is to help people identify gaps in safety cases, and, by doing so, identify neglected directions in AI Safety research. Epistemic status: exploratory.

One sentence summary: To understand the risks AI systems pose, it is useful to look at affordances they have available to act upon the world.^[2]

If you work in AI Safety, you’re probably worried about how these powerful systems will, in the present and in the future, affect the way we live. Ie. you’re interested in the impact of AI “on the real world”. This impact can be either the direct or indirect result of actions taken by these systems. Direct risks include: malicious actors using coding agents to perform cyberattacks; LLMs telling users how to synthesize drugs; or sycophantic models that induce psychotic breaks. Indirect risks include: AI race dynamics escalating to conflicts between nations; automation of various parts of the economy causing unemployment; and gradual disempowerment scenarios.

Analysing the affordances of AI systems can help us prepare for the former category of risks (but less so for the latter). Why? Well, with indirect risks, (a) we might no clearly understand the relationship between the actions of an AI system and bad outcomes, and (b) there might be other, potentially more tractable levers to avoid bad outcomes. For example, with race dynamics between nations, we’re not sure what threshold of capabilities, when crossed, might lead to conflict!^[3]Also, international cooperation might be a more tractable lever to avoid race dynamics than (say) limiting capabilities to a ceiling safe enough to avoid sparking race dynamics. The latter feels convoluted to even think about.

Current AI systems (ie. LLM-based systems) have few affordances through which they can actually ‘do things’ in the real world. The biggest channels seem to be: coding, tool use, and maybe human persuasion. I’ll briefly discuss the potential risks and interventions on each of these affordances; but before that, some caveats:

The following examples are solely illustrative. A good analysis of threat-vectors / bottlenecks relies on extensive domain-specific knowledge (eg. details of the infrastructure used to serve models, or eg. the control protocols that labs are capable of & are willing to implement). I don’t have this knowledge, and I’d be surprised if any details below were object-level useful AND novel.
This coding / tool use / human persuasion division is supposed to be a useful starting point, not a ‘correct’ or ‘complete’ taxonomy in any sense.

Coding

Failure mode: LLM-generated code might contain backdoors.^[4]
Potential solution: Have some way to flag code that’s LLM generated? This doesn’t solve the security concern, but it allows us to later attribute the failure to the LLM.
Failure mode: The model gets root access.
Potential solutions: Sandboxing coding agents. Or targeted unlearning of shell-languages.
Failure mode: Dual-use code that seems innocuous, but is actually used maliciously
Potential solution: None! In the extreme case, the distributions of (malicious) dual-use code & innocuous code overlap. Thus, any intervention would have to be at a level separate from just looking at the model’s output.

Tool Use
The threat vectors / mediating pathways largely vary tool-by-tool. That being said, here’s a bit about tool use in the general

Failure mode: The model uses a tool maliciously.
Potential solutions: Trivially, don’t give critical tasks to models. Or have a human-in-the-loop to sign off on critical tool uses
Failure mode: The model accidentally misuses a tool.
Potential solutions: Train models to follow the instructions more reliably. Also, provide clear documentation for tool use.
Other misc solutions: Actively monitor (a fraction of) all tool uses. Also, consider logging tool uses to have a traceable record in the case of bad outcomes.

Human Persuasion
This is a bit more speculative, and I was finding it hard to say something concrete.

Failure mode: Nudging / persuading a person to carry out xyz undesired action
Failure mode: Making credible threats / bribes
Failure mode: Using commitment devices, somehow?
Failure mode: Memetic viruses

This is not a novel frame by any means! It’s roughly security mindset applied to AI safety. It’s discussed explicitly and implicitly with work done on safety cases and control. ↩︎
An affordance roughly means – 'the actions available to the AI system'. ↩︎
Note my eliding between actions that one specific AI system might take vs the broader frontier of AI capabilities ↩︎
Note: here, this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). We’re focusing on the mediating pathway directly, as opposed to multiple potential upstream causes.
Sometimes this is useful. Other times, if (say) we only have a few upstream causes, it might be useful to focus on them instead. ↩︎
Note my eliding between actions that one specific AI system might take, vs the broader range of potential AI capabilities ↩︎
Note: here, this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). We’re focusing on the mediating pathway directly, as opposed to multiple potential upstream causes.
Sometimes this is useful. Other times, if (say) we only have a few upstream causes, it might be useful to focus on them instead. ↩︎
Alternatively, a more ‘fundamental’ taxonomy might bundle code-writing & code-execution under text output & tool use respectively. ↩︎
Note: this is agnostic to *why* the backdoor was inserted (misaligned model, jailbroken model, unintentional backdoor, etc). It gives us more leverage to focus on the narrow mediating pathway directly, as opposed to the multiple upstream causes. ↩︎
Either simply through comments, or through more involved tracking in the IDE, or something in between. ↩︎

LESSWRONG
LW

LESSWRONG
LW

3

AI Safety – Analyse Affordances

3

3

3