I haven't paid much attention to the formalism. It's unclear why formalism would be important under current approaches to implementing AI.
The basin of attraction metaphor is an imperfect way of communicating an advantage of corrigibility. An ideal metaphor would portray a somewhat weaker and less reliable advantage, but that advantage is still important.
The feedback loop issue seems like a criticism of current approaches to training and verifying AI, not of CAST. This issue might mean that we need a radical change in architecture. I'm more optimistic than Max about the ability of some current approaches (constitutional AI) to generalize well enough that we can delegate the remaining problems to AIs that are more capable than us.
I'm glad to see a thoughtful attempt at how to prioritize corrigibility. You've given me plenty to think about.
This review has plenty of good parts, but I disagree with lots of your probabilities.
Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%.
No. I expect mistakes in each of those 90% predictions to be significantly correlated. Why do you combine them as if they're independent?
Most people (possibly including Max?) still underestimate the importance of this sequence.
I continue to think (and write) about this more than I think about the rest of the 2024 LW posts combined.
The most important point is that it's unsafe to mix corrigibility with other top level goals. Other valuable goals can become subgoals of corrigibility. That eliminates the likely problem of the AI having instrumental reasons to reject corrigibility.
The second best feature of the CAST sequence is its clear and thoughtful clarification of the concept of corrigibility as a single goal.
My remaining doubts about corrigibility involve the risk that it will cause excessive concentration of power. In multipolar scenarios where alignment is not too hard, I can imagine that the constitutional approach produces a better world.
I'm still uncertain how hard it is to achieve corrigibility. Drexler has an approach where AIs have very bounded goals, which seems to achieve corrigibility as a natural side effect. We are starting to see a few hints that the world might be heading in the direction that Drexler recommends: software is being written by teams of Claudes, each performing relatively simple tasks, rather than having one instance do everything. But there's still plenty of temptation to gives AIs less bounded goals.
See also a version of CAST published on arXiv: Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models.
Scott Alexander has an argument (You Have Only X Years To Escape Permanent Moon Ownership) which seems partly directed against this post. I'm still siding with Rudolf.
Scott's argument depends more than I'm comfortable with on expectations that the wealthy will be as altruistic toward distant strangers. I expect that such altruism depends strongly on cultural forces that we're poor at predicting. I expect that ASI will trigger large cultural changes. Support for such altruism seems fragile enough that it seems like a crap-shoot whether it will endure. I find it easy to imagine that such altruism is a relatively accidental byproduct of WEIRD culture, rather than an enduring feature of affluent society.
I've mostly agreed with the ideas in Rudolph's post since before it was written, but I wouldn't have found time to articulate them as clearly as this post does.
The post somewhat overstates the likely decrease in social mobility. I expect some social interactions will continue to affect social status, maybe mostly via games.
I wish there were more ideas about how to avoid extreme inequality in political power, but I don't have good suggestions there.
This post provides important arguments about what goals an AGI ought to have.
DWIMAC seems slightly less likely to cause harm than Max Harms' CAST, but CAST seems more capable of dealing with other AGIs that are less nice.
My understanding of the key difference is that DWIMAC doesn't react to dangers that happen too fast for the principal to give instructions, whereas CAST guesses what the principal would want.
If we get a conflict between AIs at a critical time, I'd prefer to have CAST.
Seth's writing is more readable than Max's CAST sequence, so it's valuable to have it around as a complement to Max's writings.
This still seems like a valuable approach that will slightly reduce AI risks. This kind of research deserves to be in the top 10 posts of 2024.
The belief that they can do both is very fixable. The solution that I recommend is to prioritize corrigibility.
My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic's strategy.
I agree with all your points except this:
I expect there's lots of room to disguise distributed AIs so that they're hard to detect.
Maybe there's some level of AI capability where the good AIs can do an adequate job of policing a slowdown. But I don't expect a slowdown that starts today to be stable for more than 5 to 10 years.