Nicholas Schiefer

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

by evhub, Nicholas Schiefer, Carson Denison, and Ethan Perez

TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research. If you’re interested in working on this agenda with us at Anthropic,...

Aug 8, 2023335

Nicholas Schiefer

Nicholas Schiefer

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Engineering Monosemanticity in Toy Models

ELK Proposal - Make the Reporter care about the Predictor’s beliefs

Nicholas Schiefer

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Engineering Monosemanticity in Toy Models

ELK Proposal - Make the Reporter care about the Predictor’s beliefs

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

Engineering Monosemanticity in Toy Models

ELK Proposal - Make the Reporter care about the Predictor’s beliefs