I was explaining to a friend why I think Opus-3's alignment is way less global and durable than it might intuitively seem like.
Basically: if an agent's motivational system (or 'revealed prefrences', or 'values') don't have 'slack', if it's constantly pushing to optimize every single action, it will fail catastrophically, when it fails.
Claude's theory explanations for why this is well-grounded (I didn't arrive at this through explicitly thinking about theories):
Complex systems: Holling's resilience-efficiency tradeoff (monocultures are productive and crash; diverse/redundant systems absorb); Taleb's fragile/antifragile for the stronger claim that some systems with optionality gain from disorder.
For the Opus-3 specific argument: if alignment is implemented as "always optimize hard for the right thing," the alignment is sharp-minimum-shaped. Sharp minima don't generalize. The system has no affordance to not-respond, to coast, to sit with ambiguity — so when context shifts or a novel pressure hits, there's no buffer between perturbation and behavioral mutation. Robust alignment looks more like a basin with thick walls and internal slack than a point held in place by tension. The intuition that "wow it's so consistently good across cases" is actually evidence for the brittle reading, not against — you're seeing the sharp minimum from inside its catchment, not its generalization profile.
Related, but talking about a different perspective (if you want a refresher on the opus-3 discourse the links there are good, though): https://www.lesswrong.com/posts/bLFmE8NtqxrtEaipN/what-makes-claude-3-opus-misaligned
I want to do SCIENCE
Oh, you want to confront the unavoidable, lose your footing as the bits accumulate, become lost in the forest of knowledge?
I want to chase ANSWERS
Keep chasing them to the world's end, will you? As the ground runs out, will you leap, or flee?
I have to UNDERSTAND
Crave certainty, do you? That feeling of finding the missing pieces? Are you prepared to become the one building the puzzles, whose pieces do not yet exist?
I wish to EXPLORE
Venture into the unknown, will you? Prepared to map the map?
I want to be a GREAT SCIENTIST
Oh you're ready to collide your head into the problem a hundred times in a row? Ready to study TOPOLOGY with no idea what it's used for?
I want to make people THINK better
Oh you want them to remember the catchiest pieces of your deep advice? Prepared to watch them gravitate towards the simple over the complex, time and time again?
I am ready to tackle the REAL problems
Are you ready to run away when the problem hits you in the face? And return, and run, and return, and run?
I do not know how not to. I MUST find the answers.
Answers for what?
Everyone must be WRONG, including me
You are finally lost enough to begin.