The “multiple stage fallacy fallacy” is the fallacious idea that equations like
are false, when in fact they are true. :-P
I think Nate here & Eliezer here are pointing to something real, but the problem is not multiple stages per se but rather (1) “treating stages as required when in fact they’re optional” and/or (2) “failing to properly condition on the conditions and as a result giving underconfident numbers”. For example, if A & B & C have all already come true in some possible universe, then that’s a universe where maybe you have learned something important and updated your beliefs, and you need to imagine yourself in that universe before you try to evaluate
Of course, that paragraph is just parroting what Eliezer & Nate wrote, if you read what they wrote. But I think other people on LW have too often skipped over the text and just latched onto the name “multiple stages fallacy” instead of drilling down to the actual mistake.
In the case at hand, I don’t have much opinion in the absence of more details about the AI training approach etc., but here’s a couple general comments.
If an AI development team notices Problem A and fixes it, and then notices Problem B and fixes it, and then notices Problem C and fixes it, we should expect that it’s less likely, not more likely, that this same team will preempt Problem D before Problem D actually occurs.
Conversely, if the team has a track record of preempting every problem before it arises (when the problems are low-stakes), then we can have incrementally more hope that they will also preempt high-stakes problems.
Likewise, if there simply are no low-stakes problems to preempt or respond to, because it’s a kind of system that just automatically by its nature has no problems in the first place, then we can feel generically incrementally better about there not being high-stakes problems.
Those comments are all generic, and readers are now free to argue with each other about how they apply to present and future AI. :)
The neural tangent kernel[1] provides an intuitive story for how neural networks generalize: a gradient update on a datapoint will shift similar (as measured by the hidden activations of the NN) datapoints in a similar way.
The vast majority of LLM capabilities still arise from mimicking human choices in particular circumstances. This gives you a substantial amount of alignment "for free" (since you don't have to worry that the LLMs will grab excess power when humans don't), but it also limits you to ~human-level capabilities.
"Gradualism" can mean that fundamentally novel methods only make incremental progress on outcomes, but in most people's imagination I think it rather means that people will keep the human-mimicking capabilities generator as the source of progress, mainly focusing on scaling it up instead of on deriving capabilities by other means.
Maybe I should be cautious about invoking this without linking to a comprehensible explanation of what it means, since most resources on it are kind of involved...
In analogy, someone who wanted to argue that it's infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an "ought"/steering problem, not an "is"/prediction problem! One-shotting all of those separate problems isn't something that human beings can do in real life, the argument would go. But of course, the problems aren't independent, and text-to-image generators do exist.
Isn't part of the deal here that we didn't one-shot image generation, though?
The first image generators were crazy, we slowly iterated on them, and image generation is "easy" because unlike superintelligence or even self-driving cars or regular ol' production code, nothing particularly bad happens if a given image is bad.
Doomimir: The possibility of AGI being developed gradually doesn't obviate the problem of the "first critical try": the vast hypermajority of AGIs that seem aligned in the "Before" regime when they're weaker than humans, will still want to kill the humans "After" they're stronger and the misalignment can no longer be "corrected". The speed of the transition between those regimes doesn't matter. The problem still exists and is still fatal whether it takes a day or a decade.
Simplicia: I agree that the risk you describe is real, but I don't understand why you're so sure the risk is high. As we've discussed previously, the surprising fact that deep learning works at all comes down to generalization. In principle, an astronomical number of functions are compatible with the training data, the astronomical supermajority of which do something crazy or useless on non-training inputs. But the network doesn't have a uniform prior on all possible functions compatible with the training data; there's a bias towards simple functions that "generalize well" in some sense.
There's definitely a risk of goal misgeneralization, where we were mistaken about how behavior in the Before regime generalizes to behavior in the After regime. But if we work hard to test and iterate on our AI's behavior in the settings where we can observe and correct it, isn't there hope of it generalizing to behave well once we can no longer correct it? In analogy, it's not inhumanly hard to design and build machines on land that successfully generalize to functioning on an airplane or in a submarine.
Doomimir: Or in space, or inside the sun? Suppose that two dozen things change between Before and After. Even if you anticipate and try to devise solutions for three-quarters of them, not all of your solutions are going to work on the first critical try, and then there are the problems you failed to anticipate. This isn't the kind of thing humans beings can pull off in real life.
Simplicia: Sorry, this is probably a stupid question, but isn't that reasoning similar to the multiple stage fallacy that you've derided elsewhere?
That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people's reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.
As an illustrative example, suppose that the "correct" probability of some proposition is 0.9. Someone who wants to argue that is unlikely represents it as a conjunction of two dozen sub-propositions: is true if and only if is true, and is true given that is true, and is true given that and are true, and so on up to .
Someone who assumed the sub-propositions were independent and assigned them each a probability 0.95 would only assign a probability of . Indeed, with the assumption of independence of the sub-propositions , one would need to assign each an intuitively "extreme"-looking probability of in order to assign the correct probability of 0.9. Which should be a clue that the aren't really independent, that that choice of decomposition into sub-propositions was a poor one—with respect to the goal of getting the right answer, as contrasted to the goal of tricking respondents into assigning a low probability to .
So when you posit that two dozen things change between a detectable/correctable-failures regime and a fatal regime, such that the conjunctive probability of not hitting a fatal misgeneralization is tiny, how do I know you're not committing a multiple stage fallacy? Why is that a "correct", non-misleading decomposition into sub-propositions?
In analogy, someone who wanted to argue that it's infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an "ought"/steering problem, not an "is"/prediction problem! One-shotting all of those separate problems isn't something that human beings can do in real life, the argument would go. But of course, the problems aren't independent, and text-to-image generators do exist.
Is there a version of your argument that doesn't depend on the equivalent of, "Suppose there are twenty-four independent things that can go wrong, surely you don't want to bet the world on them each succeeding with probability 0.996"?
Doomimir: You're right: that is a stupid question.
Simplicia: [head down in shame] I know. It's only ... [straightening up] I would like to know the answer, though. [turning to the audience] Do you know?