LESSWRONG
LW

Scheming AIs: Will AIs fake alignment during training in order to get power?

Nov 20, 2023 by Joe Carlsmith

This is a LessWrong sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://arxiv.org/pdf/2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I'm hoping it will provide much of the context necessary to understand individual sections of the report on their own.

81New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Ω
Joe Carlsmith
2y
Ω
28
15Varieties of fake alignment (Section 1.1 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
0
13A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
0
8Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
0
1On “slack” in training (Section 1.5 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
0
10Situational awareness (Section 2.1 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
5
19Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
1
11Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
1
29“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
1
8Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)
Ω
Joe Carlsmith
2y
Ω
0
10How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
1
8The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
1
9Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
0
9Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
0
10Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
0
10The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
0
10Simplicity arguments for scheming (Section 4.3 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
1
9Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
0
2Summing up "Scheming AIs" (Section 5)
Ω
Joe Carlsmith
2y
Ω
1
8Empirical work that might shed light on scheming (Section 6 of "Scheming AIs")
Ω
Joe Carlsmith
2y
Ω
0