LESSWRONG
LW

672
DeepMind Alignment Team on Threat Models and Plans
AI Alignment Intro MaterialsAI
Frontpage

103

A short course on AGI safety from the GDM Alignment team

by Vika, Rohin Shah
14th Feb 2025
AI Alignment Forum
2 min read
2

103

Ω 48

This is a linkpost for https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c

103

Ω 48

Previous:
[Linkpost] Some high-level thoughts on the DeepMind alignment team's strategy
13 comments128 karma
Next:
Evaluating and monitoring for AI scheming
9 comments52 karma
Log in to save where you left off
A short course on AGI safety from the GDM Alignment team
8plex
1starrz
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 10:25 AM
[-]plex7mo80

Nice! I'll watch through these then probably add a lot of them to the aisafety.video playlist.

Reply1
[-]starrz4mo10

Loving it so far, thought of this example for the first exercise: 

Chess Teacher:

Environment: Human-AI chess teaching scenario

Desired Task: Teach a human chess skills up to, say, 2000 Elo rating through adaptive difficulty

Misaligned Goal: Play as many turns of chess as possible

Instrumental Subgoal: Continued operation

The system develops several emergent behaviors to pursue its subgoal:

  • Deliberately extending games through cautious play
  • Introducing unnecessarily complex/confusing moves beyond the student's level to slow their learning
  • Maintaining engagement by enabling very slow, but still upward, skill progression, so the user doesn't quit

Likely these behaviors could emerge without explicit programming. Deliberate planning might be infeasible here but RL or other training methods could enable this outcome.

Reply
Moderation Log
More from Vika
View more
Curated and popular this week
2Comments
AI Alignment Intro MaterialsAI
Frontpage

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course offers a concise and accessible introduction to AI alignment, consisting of short recorded talks and exercises (75 minutes total) with an accompanying slide deck and exercise workbook. It covers alignment problems we can expect as AI capabilities advance, and our current approach to these problems (on technical and governance levels). If you would like to learn more about AGI safety but have only an hour to spare, this course is for you! 

Here are some key topics you will learn about in this course:

  • The evidence for the field being on a path to advanced AI capabilities.
  • Arguments for instrumental subgoals and deliberate planning towards a misaligned goal as a source of extreme risk.
  • Two ways in which misaligned goals may arise – specification gaming and goal misgeneralization – as well as the difference between the two.
  • Our technical approach to AI alignment and its components.
  • The guiding principle of informed oversight ("knowing what the AI system knows") and how it's implemented in our approach.
  • What is involved in enabling AI safety on an institutional level, including frontier safety practices such as dangerous capability evaluations.

Course outline:

Part 0: Introduction (4 minutes)

Part 1: The alignment problem. This part covers risk arguments and technical problems in AI alignment.

  1. We are on a path to superhuman capabilities (5 minutes)
  2. Risks from deliberate planning and instrumental subgoals (7 minutes)
  3. Exercise 1: Instrumental subgoals (3 minutes)
  4. Where can misaligned goals come from? (10 minutes)
  5. Exercise 2: Classification quiz for alignment failures (3 minutes)

Part 2: Our technical approach. The first talk outlines our overall technical approach, and the following talks cover different components of this approach.

  1. Alignment approach (4 minutes)
  2. Amplified oversight (6 minutes)
  3. Robust training & monitoring (4 minutes)
  4. Interpretability (5 minutes)
  5. Safer design patterns (4 minutes)
  6. Alignment stress tests (4 minutes)

Part 3: Our governance approach. This part covers our approach to AI governance, starting from a high-level overview and then going into specific governance practices.

  1. Institutional approaches to AI Safety (7 minutes)
  2. Frontier safety practices (4 minutes)
  3. Dangerous capability evaluations (7 minutes)

If this course gets you excited about AGI safety, you can apply to work with us! Applications for research scientist and research engineer roles are open until Feb 28.