If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training
Current AI safety approaches have a structural problem: they ask the same system to both generate outputs and judge whether those outputs are safe. RLHF trains a model to produce outputs that a reward model likes. Constitutional AI has the model critique its own outputs against principles. In both cases,...