LESSWRONG
LW

3356

DeepMind Alignment Team on Threat Models and Plans

DeepMind Alignment Team on Threat Models and Plans

Nov 25, 2022 by Vika

A collection of posts presenting our understanding of and opinions on alignment threat models and plans.

150DeepMind is hiring for the Scalable Alignment and Alignment Teams

Rohin Shah, Geoffrey Irving

3y

34

397DeepMind alignment team opinions on AGI ruin arguments

3y

37

133Will Capabilities Generalise More?

3y

39

127Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, Elliot Catt

3y

24

79Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, Elliot Catt

3y

4

86Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Vika, Vikrant Varma, Ramana Kumar, Mary Phuong

3y

4

39Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Vika, Vikrant Varma, Ramana Kumar, Rohin Shah

3y

9

93Categorizing failures as “outer” or “inner” misalignment is often confused

3y

21

43Definitions of “objective” should be Probable and Predictive

3y

27

56Power-seeking can be probable and predictive for trained agents

3y

22

53Paradigms of AI alignment: components and enablers

3y

4

128[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy

Vika, Rohin Shah

3y

13

105A short course on AGI safety from the GDM Alignment team

Vika, Rohin Shah

9mo

2

57Evaluating and monitoring for AI scheming

Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, Rohin Shah

4mo

9