Timothy Danforth

Message

2mo

Timothy Danforth

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training

Current AI safety approaches have a structural problem: they ask the same system to both generate outputs and judge whether those outputs are safe. RLHF trains a model to produce outputs that a reward model likes. Constitutional AI has the model critique its own outputs against principles. In both cases,...

Dec 8, 20251

Home is Not a Point in Space

My eyes are burning from the strain of how aggressively I am not-crying right now. For as long as I can remember I have been the odd one out. I never cared about what everyone cared about, or at least not in the "normal way" that people were expected to...

Dec 8, 20251

LESSWRONG
LW

LESSWRONG
LW

Timothy Danforth

Timothy Danforth

Timothy Danforth

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training

Home is Not a Point in Space

Timothy Danforth

Timothy Danforth

Timothy Danforth

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training

Home is Not a Point in Space