LESSWRONG
LW

24
Wikitags

AI Control

Edited by Ben Pace last updated 17th Aug 2024

AI Control in the context of AI Alignment is a category of plans that aim to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures. From The case for ensuring that powerful AIs are controlled:

In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.

 and

There are two main lines of defense you could employ to prevent schemers from causing catastrophes.

  • Alignment: Ensure that your models aren't scheming.[2]
  • Control: Ensure that even if your models are scheming, you'll be safe, because they are not capable of subverting your safety measures.[3]
Subscribe
Discussion
5
Subscribe
Discussion
5
Posts tagged AI Control
239AI Control: Improving Safety Despite Intentional Subversion
Ω
Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan
2y
Ω
24
278The case for ensuring that powerful AIs are controlled
Ω
ryan_greenblatt, Buck
2y
Ω
73
359The Case Against AI Control Research
johnswentworth
8mo
82
113Catching AIs red-handed
Ω
ryan_greenblatt, Buck
2y
Ω
27
90How to prevent collusion when using untrusted models to monitor each other
Ω
Buck
1y
Ω
12
71Schelling game evaluations for AI control
Ω
Olli Järviniemi
1y
Ω
5
70How useful is "AI Control" as a framing on AI X-Risk?
habryka, ryan_greenblatt
2y
4
69AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
Ω
DanielFilan
1y
Ω
10
48Critiques of the AI control agenda
Ω
Jozdien
2y
Ω
14
99AI Control May Increase Existential Risk
Ω
Jan_Kulveit
6mo
Ω
13
124Ctrl-Z: Controlling AI Agents via Resampling
Ω
Aryan Bhatt, Buck, Adam Kaufman, Cody Rushing, Tyler Tracy
5mo
Ω
0
119Preventing Language Models from hiding their reasoning
Ω
Fabien Roger, ryan_greenblatt
2y
Ω
15
101Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Ω
Buck
1y
Ω
4
62Why imperfect adversarial robustness doesn't doom AI control
Ω
Buck, Claude+
10mo
Ω
25
49Notes on control evaluations for safety cases
Ω
ryan_greenblatt, Buck, Fabien Roger
2y
Ω
0
Load More (15/144)
Add Posts