x

LESSWRONG

LW

CDelair3 — LessWrong

CDelair3

CDelair3

Message

1

2mo

CDelair3

2mo

Anthropic was right…but “right” isn’t provable without formal specification.

The entire AI safety industry has spent billions guessing what alignment means such as training classifiers, red-teaming models, writing policies, and straight up hoping behavior holds under pressure. What they haven’t done is formalize their alignment approach. Alignment is treated as a probabilistic prediction problem rather than a structural guarantee....