This review has plenty of good parts, but I disagree with lots of your probabilities.
Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%.
No. I expect mistakes in each of those 90% predictions to be significantly correlated. Why do you combine them as if they're independent?
Most people (possibly including Max?) still underestimate the importance of this sequence.
I continue to think (and write) about this more than I think about the rest of the 2024 LW posts combined.
The most important point is that it's unsafe to mix corrigibility with other top level goals. Other valuable goals can become subgoals of corrigibility. That eliminates the likely problem of the AI having instrumental reasons to reject corrigibility.
The second best feature of the CAST sequence is its clear and thoughtful clarification of the concept of corrigibility as a single goal.
My remaining doubts about corrigibility involve the risk that it will cause excessive concentration of power. In multipolar scenarios where alignment is not too hard, I can imagine that the constitutional approach produces a better world.
I'm still uncertain how hard it is to achieve corrigibility. Drexler has an approach where AIs have very bounded goals, which seems to achieve corrigibility as a natural side effect. We are starting to see a few hints that the world might be heading in the direction that Drexler recommends: software is being written by teams of Claudes, each performing relatively simple tasks, rather than having one instance do everything. But there's still plenty of temptation to gives AIs less bounded goals.
See also a version of CAST published on arXiv: Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models.
Scott Alexander has an argument (You Have Only X Years To Escape Permanent Moon Ownership) which seems partly directed against this post. I'm still siding with Rudolf.
Scott's argument depends more than I'm comfortable with on expectations that the wealthy will be as altruistic toward distant strangers. I expect that such altruism depends strongly on cultural forces that we're poor at predicting. I expect that ASI will trigger large cultural changes. Support for such altruism seems fragile enough that it seems like a crap-shoot whether it will endure. I find it easy to imagine that such altruism is a relatively accidental byproduct of WEIRD culture, rather than an enduring feature of affluent society.
I've mostly agreed with the ideas in Rudolph's post since before it was written, but I wouldn't have found time to articulate them as clearly as this post does.
The post somewhat overstates the likely decrease in social mobility. I expect some social interactions will continue to affect social status, maybe mostly via games.
I wish there were more ideas about how to avoid extreme inequality in political power, but I don't have good suggestions there.
This post provides important arguments about what goals an AGI ought to have.
DWIMAC seems slightly less likely to cause harm than Max Harms' CAST, but CAST seems more capable of dealing with other AGIs that are less nice.
My understanding of the key difference is that DWIMAC doesn't react to dangers that happen too fast for the principal to give instructions, whereas CAST guesses what the principal would want.
If we get a conflict between AIs at a critical time, I'd prefer to have CAST.
Seth's writing is more readable than Max's CAST sequence, so it's valuable to have it around as a complement to Max's writings.
This still seems like a valuable approach that will slightly reduce AI risks. This kind of research deserves to be in the top 10 posts of 2024.
The belief that they can do both is very fixable. The solution that I recommend is to prioritize corrigibility.
My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic's strategy.
A significant part of why I continue to devote attention to my health is that it may be more important than usual over the next decade for my cognitive abilities to be near peak levels.
It sounds like a real phenomenon, but I have trouble imagining a scenario where it's important. I expect demand for human labor to decline faster than the number of people with investment income rises. That probably means declining wages for the median person, although maybe rising wages for a small number of people with unusual skills.
I'm glad to see a thoughtful attempt at how to prioritize corrigibility. You've given me plenty to think about.