The “multiple stage fallacy fallacy” is the fallacious idea that equations like
are false, when in fact they are true. :-P
I think Nate here & Eliezer here are pointing to something real, but the problem is not multiple stages per se but rather (1) “treating stages as required when in fact they’re optional” and/or (2) “failing to properly condition on the conditions and as a result giving underconfident numbers”. For example, if A & B & C have all already come true in some possible universe, then that’s a universe where maybe you have learned something important and updated your beliefs, and you need to imagine yourself in that universe before you try to evaluate
Of course, that paragraph is just parroting what Eliezer & Nate wrote, if you read what they wrote. But I think other people on LW have too often skipped over the text and just latched onto the name “multiple stages fallacy” instead of drilling down to the actual mistake.
In the case at hand, I don’t have much opinion in the absence of more details about the AI training approach etc., but here’s a couple general comments.
If an AI development team notices Problem A and fixes it, and then notices Problem B and fixes it, and then notices Problem C and fixes it, we should expect that it’s less likely, not more likely, that this same team will preempt Problem D before Problem D actually occurs.
Conversely, if the team has a track record of preempting every problem before it arises (when the problems are low-stakes), then we can have incrementally more hope that they will also preempt high-stakes problems.
Likewise, if there simply are no low-stakes problems to preempt or respond to, because it’s a kind of system that just automatically by its nature has no problems in the first place, then we can feel generically incrementally better about there not being high-stakes problems.
Those comments are all generic, and readers are now free to argue with each other about how they apply to present and future AI. :)
It's kinda covered by 1 and 2 of you apply it right, but one view on how this plays it that I've found helpful is: Having model uncertainty on many individual steps predictably making the output look low confidence. If you can break something into 10 steps that you multiply together, and feel uncomfortable assigning more than 0.8 to any individual guess, you're always going to have a low final answer.
The neural tangent kernel[1] provides an intuitive story for how neural networks generalize: a gradient update on a datapoint will shift similar (as measured by the hidden activations of the NN) datapoints in a similar way.
The vast majority of LLM capabilities still arise from mimicking human choices in particular circumstances. This gives you a substantial amount of alignment "for free" (since you don't have to worry that the LLMs will grab excess power when humans don't), but it also limits you to ~human-level capabilities.
"Gradualism" can mean that fundamentally novel methods only make incremental progress on outcomes, but in most people's imagination I think it rather means that people will keep the human-mimicking capabilities generator as the source of progress, mainly focusing on scaling it up instead of on deriving capabilities by other means.
Maybe I should be cautious about invoking this without linking to a comprehensible explanation of what it means, since most resources on it are kind of involved...
In analogy, someone who wanted to argue that it's infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an "ought"/steering problem, not an "is"/prediction problem! One-shotting all of those separate problems isn't something that human beings can do in real life, the argument would go. But of course, the problems aren't independent, and text-to-image generators do exist.
Isn't part of the deal here that we didn't one-shot image generation, though?
The first image generators were crazy, we slowly iterated on them, and image generation is "easy" because unlike superintelligence or even self-driving cars or regular ol' production code, nothing particularly bad happens if a given image is bad.
That said, FYI I was kind of enlightened by this phrasing:
That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people's reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.
I'd been feeling sus about why the multiple stage fallacy was even a fallacy at all, apart from "somehow in practice people fuck it up." Multiplying probabilities together is... like, how else are you supposed to do any kind of sophisticated reasoning?
But, "because people are scared of (or bad at) assigning extreme probabilities" feels like it explains it to me.
I think the larger effect is treating the probabilities as independent when they're not.
Suppose I have a jar of jelly beans, which are either all red, all green or all blue. You want to know what the probability of drawing 100 blue jelly beans is. Is it ? No, of course not. That's what you get if you multiply 1/3 by itself 100 times. But you should condition on your results as you go. P(jelly1 = blue)⋅P(jelly2=blue|jelly1=blue)⋅P(jelly3=blue|jelly1=blue,jelly2=blue) ...
Every factor but the first is 1, so the probability is .
There are two different issues with "the first critical try" (the After regime), where misalignment is lethal. First, maybe alignment is sufficiently solved, and so when you enter After, that's why it doesn't kill you. But second, maybe After never arrives.
Gradualist arguments press both issues, not just alignment of After. Sufficient control makes increasingly capable AIs non-lethal if misaligned, which means that an AI that would bring about the After regime today wouldn't do so in the future where better countermeasures (that are not about alignment) are in place. Which is to say this particular AI won't enter the After regime yet, because the world is sufficiently different and this AI's capabilities are now insufficient for lethality, an even more capable AI would be necessary for that.
This is different from an ASI Pause delaying the After regime until ASI-grade alignment is solved, because the level of capabilities that counts as After keeps changing. Instead of delaying ASI at a fixed level of capabilities until alignment is solved, After is being pushed into the future by increasing levels of control that make increasingly capable AIs non-critical. As a result, After never happens at all, instead of only happening once alignment at a relevant level is sufficiently solved.
(Of course the feasibility of ASI-grade control is as flimsy as the feasibility of ASI-grade alignment, when working on a capabilities schedule without an AGI/ASI Pause, not to mention gradual disempowerment in the gradualist regime without an AGI Pause. But the argument is substantially different, and a proponent of gradualist development of ASI-grade control might feel that there is no fixed After, and maybe that After never actually arrives even as capabilities keep increasing. The arguments against feasibility of gradualist development of ASI-grade alignment on the other hand feel like they are positing a fixed After whose arrival remains inevitable at some point, which doesn't acknowledge the framing from gradualist arguments about development of ASI-grade control.)
Doomimir: The possibility of AGI being developed gradually doesn't obviate the problem of the "first critical try": the vast hypermajority of AGIs that seem aligned in the "Before" regime when they're weaker than humans, will still want to kill the humans "After" they're stronger and the misalignment can no longer be "corrected". The speed of the transition between those regimes doesn't matter. The problem still exists and is still fatal whether it takes a day or a decade.
Simplicia: I agree that the risk you describe is real, but I don't understand why you're so sure the risk is high. As we've discussed previously, the surprising fact that deep learning works at all comes down to generalization. In principle, an astronomical number of functions are compatible with the training data, the astronomical supermajority of which do something crazy or useless on non-training inputs. But the network doesn't have a uniform prior on all possible functions compatible with the training data; there's a bias towards simple functions that "generalize well" in some sense.
There's definitely a risk of goal misgeneralization, where we were mistaken about how behavior in the Before regime generalizes to behavior in the After regime. But if we work hard to test and iterate on our AI's behavior in the settings where we can observe and correct it, isn't there hope of it generalizing to behave well once we can no longer correct it? In analogy, it's not inhumanly hard to design and build machines on land that successfully generalize to functioning on an airplane or in a submarine.
Doomimir: Or in space, or inside the sun? Suppose that two dozen things change between Before and After. Even if you anticipate and try to devise solutions for three-quarters of them, not all of your solutions are going to work on the first critical try, and then there are the problems you failed to anticipate. This isn't the kind of thing humans beings can pull off in real life.
Simplicia: Sorry, this is probably a stupid question, but isn't that reasoning similar to the multiple stage fallacy that you've derided elsewhere?
That is, in the multiple stage fallacy, someone who wishes to portray a proposition as unlikely can prey on people's reluctance to assign extreme probabilities by spuriously representing the proposition as a conjunction of sub-propositions that all need to be true.
As an illustrative example, suppose that the "correct" probability of some proposition is 0.9. Someone who wants to argue that is unlikely represents it as a conjunction of two dozen sub-propositions: is true if and only if is true, and is true given that is true, and is true given that and are true, and so on up to .
Someone who assumed the sub-propositions were independent and assigned them each a probability 0.95 would only assign a probability of . Indeed, with the assumption of independence of the sub-propositions , one would need to assign each an intuitively "extreme"-looking probability of in order to assign the correct probability of 0.9. Which should be a clue that the aren't really independent, that that choice of decomposition into sub-propositions was a poor one—with respect to the goal of getting the right answer, as contrasted to the goal of tricking respondents into assigning a low probability to .
So when you posit that two dozen things change between a detectable/correctable-failures regime and a fatal regime, such that the conjunctive probability of not hitting a fatal misgeneralization is tiny, how do I know you're not committing a multiple stage fallacy? Why is that a "correct", non-misleading decomposition into sub-propositions?
In analogy, someone who wanted to argue that it's infeasible to build text-to-image generative models that make art that humans enjoy, could partition their prediction of failure into disjunctive failure modes: the model has to generalize what hands look like; it has to generalize what birds look like. It has to generalize which compositions and color combinations are pleasing—which is arguably an "ought"/steering problem, not an "is"/prediction problem! One-shotting all of those separate problems isn't something that human beings can do in real life, the argument would go. But of course, the problems aren't independent, and text-to-image generators do exist.
Is there a version of your argument that doesn't depend on the equivalent of, "Suppose there are twenty-four independent things that can go wrong, surely you don't want to bet the world on them each succeeding with probability 0.996"?
Doomimir: You're right: that is a stupid question.
Simplicia: [head down in shame] I know. It's only ... [straightening up] I would like to know the answer, though. [turning to the audience] Do you know?