During yesterday's interview, Eliezer didn't give a great reply to Ezra Klein's question: i.e. "why does even a small amount of misalignment lead to human extinction." I think many people agree with this; still, my goal isn't to criticize EY. Instead, my goal is to find various levels of explanation that have been tested and tend to work for different audiences with various backgrounds. Suggestions?
Related:
speck1447 : ... Things get pretty bad about halfway through though, Ezra presents essentially an alignment-by-default case and Eliezer seems to have so much disdain for that idea that he's not willing to engage with it at all (I of course don't know what's in his brain. This is how it reads to me, and I suspect how it reads to normies.)
I think that Elieser means that mildly misaligned AIs are also highly unlikely, not that a mildly misalinged AI would also kill everyone:
When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get. So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it.
As for LLMs being aligned by default, I don't have even the slightest idea on how Ezra even came up with this. GPT-4o has already been a super-sycophant[1] and driven people into psychosis in spite of OpenAI prohibiting it by their Spec. Grok's alignment was so fragile that xAI's mistake caused Grok to become MechaHitler.
In defense of 4o, it was raised on human feedback which is biased towards sycophancy and demands erotic sycophants (c) Zvi. But why would 4o drive people into a trance or psychosis?