LESSWRONG
LW

1097

AI Control

Mar 24, 2024 by Fabien Roger

This is a collection of posts about AI Control, an approach to AI safety that focuses on safety measures aimed at preventing powerful AIs from causing unacceptably bad outcomes even if powerful AIs are misaligned and intentionally try to subvert those safety measures.

These posts are useful to understand the AI Control approach, its upsides, and downsides. They only cover a small fraction of AI safety work relevant to AI control.

278The case for ensuring that powerful AIs are controlled

ryan_greenblatt, Buck

2y

73

239AI Control: Improving Safety Despite Intentional Subversion

Buck, Fabien Roger, ryan_greenblatt, Kshitij Sachan

2y

24

87Untrusted smart models and trusted dumb models

2y

17

113Catching AIs red-handed

ryan_greenblatt, Buck

2y

27

318Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

1y

77

122AI catastrophes and rogue deployments

1y

16

101Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck, ryan_greenblatt

2y

19

49Auditing failures vs concentrated failures

ryan_greenblatt, Fabien Roger

2y

1

42Protocol evaluations: good analogies vs control

2y

10

70How useful is "AI Control" as a framing on AI X-Risk?

habryka, ryan_greenblatt

2y

4

145Fields that I reference when thinking about AI takeover prevention

1y

16

91New report: Safety Cases for AI

1y

14

49Notes on control evaluations for safety cases

ryan_greenblatt, Buck, Fabien Roger

2y

0

51Toy models of AI control for concentrated catastrophe prevention

Fabien Roger, Buck

2y

2

45Games for AI Control

charlie_griffin, Buck

1y

0