LESSWRONG
LW

3194
AlexMeinke
668100
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
No Comments Found
125Stress Testing Deliberative Alignment for Anti-Scheming Training
Ω
1mo
Ω
19
115Ablations for “Frontier Models are Capable of In-context Scheming”
10mo
1
210Frontier Models are Capable of In-context Scheming
Ω
11mo
Ω
24
72Training AI agents to solve hard problems could lead to Scheming
Ω
1y
Ω
12
109Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
1y
39
93Apollo Research 1-year update
Ω
1y
Ω
0
58A starter guide for evals
Ω
2y
Ω
2
45Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize
Ω
2y
Ω
4