A flaw in the A.G.I. Ruin Argument

Cole Wyeth

Eliezer's argument that A.G.I. will kill us all has generated a lot of controversy, and also perhaps a bit of despair (possibly exacerbated by his avowed "Death with dignity" strategy). I don't want to discuss whether his choice to frame the situation this way is good or bad psychologically or rhetorically, except to say that I basically agree with credo "If the iron approaches your face, and you believe it is cool, and it is hot, the Way opposes your calm." However I think that those of us who tend to plan for worst-case outcomes should also remember that ""If the iron approaches your face, and you believe it is hot, and it is cool, the Way opposes your fear."" Instead I will focus on issues with the actual argument.

I believe there is a simple flaw in his reasoning about its epistemological grounding, as discussed here : "when you're fundamentally wrong about rocketry, this does not usually mean your rocket prototype goes exactly where you wanted on the first try while consuming half as much fuel as expected; it means the rocket explodes earlier yet, and not in a way you saw coming, being as wrong as you were"

Eliezer believes that A.I. alignment is likely to be at least as hard as he thinks, not surprisingly easy. This seems fair. However, A.I. itself is likely to be at least as hard as he thinks as well, maybe harder. This effect should tend to shift timelines later. Yet, the A.G.I. ruin argument seems much weaker if A.G.I. is 100 years away (not that I think that is likely). That state of affairs seems to leave plenty of time for humanity to "come up with a plan." In particular, recursively self-improving A.I. may be harder than expected. In fact, the difficulty of recursively self improving A.I. seems to be closely tied to the difficulty of alignment, as noted here days after I resolved to mention it myself (but done much better than I would have, and the idea was probably floating around lesswrong mindspace long before that). For instance, many of the problems of embedded agency seem to be relevant to both. I am in fact still fairly pessimistic, in the sense that I think A.I. systems that recursively self improve past human level are probably easier to design than the alignment problem is to solve. One reason is that I suspect only some of the problems of (e.g.) embedded agency need to be solved explicitly to design such systems, and the rest can probably be "black magicked away" by appropriately incentivized learners, a topic I will not discuss in detail because any such research sounds like the exact opposite of a good idea. However, I may be wrong, and it seems slightly easier to be wrong in the direction that building self-improving agents is harder than expected.

Personally I think prolonged discussion of timelines seems to be over-rated by rationalists, but briefly I view the situation like this:

That is, money buys computing hardware, and optimization algorithms convert compute to "intelligence" in the form of intelligent artifacts (which may themselves be optimizers). Then these intelligent artifacts are hooked up to actuators which let them interact with the world, and convert intelligence into power. Much of the debate over A.I. timelines seems to come down to different models of which conversions are (sub/super) linear. For instance, some people seem to think that compute can't efficiently buy intelligence beyond some point (perhaps Steven Pinker in his debate with Scott Aaronson). Others seem to think that there are limits to the extent to which intelligence buys power (Boaz Barak, the "reformists"). These discussions tend to entangle with the extent to which positive feedback loops are possible.

My personal impression is that if intelligence is useful at improving optimization algorithms, and A.I. systems actually choose to improve their optimization algorithms, at human level, without first needing to solve alignment, and if DL scales effectively with compute instead of hitting a data bottleneck, we are in trouble (~5 years).

If this form of positive feedback is limited, because human level A.I.'s are not yet smart enough to reliably improve themselves without changing their values, they will still improve their own hardware (Bostrom's argument to the contrary is mostly unpersuasive to me). This may lead to a strange situation in which A.I. are as smart as humans running at 100x speed but with terrible actuators. Depending on efficiency of the intelligence/power conversion, the situation could take longer or shorter to spiral out of control, but spiral out of control it will as actuators catch up to the demand (large scale robot manufacturing). After some years of weirdness and warnings, if left unchecked, 100x will become 10,000x and biological entities will be out of luck (~25 years). It is quite clear that power buys money, so the "longest" feedback loop in the diagram exists, even if its power is degraded by all previous conversions. It seems that this outcome corresponds roughly to all conversions being weak and only the longest feedback loop being relevant, and therefore also pretty weak. This may look like a slow takeoff.

Another plausible outcome is that performance of deep learning systems tops out (possibly because of limited data) far below human level, and nothing takes off until many embedded agency/alignment relevant problems are solved (~60 years, heavy tailed). So, my timeline is trimodal, and it seems that the later outcomes are also more favorable.

Are you positing that the argument "we only have one try to get it right" is incorrect? Or something else?

Not really. To be clear, I am criticizing the argument Eliezer tends to make. There can be flaws in that argument and we can still be doomed. I am saying his stated confidence is too high because even if alignment is as hard as he thinks, A.I. itself may be harder than he thinks, and this would give us more time to take alignment seriously.

In the second scenario I outlined (say, scenario B) where gains to intelligence feed back into hardware improvements but not drastic software improvements, multiple tries may be possible. On the whole I think that this is not very plausible (1/3 at most), and the other two scenarios look like they only give us one try.

Well, if we only have one try, extra time does not help, unless alignment is only an incremental extra on AI, and not a comparably hard extra effort. If we have multiple tries, yes, there is a chance. I don't think that at this point we have enough clue as to how it is likely to go. Certainly LLMs have been a big surprise.

Sure, if somehow you don't think foom is going to happen any year now, you also would think alignment is easy. At this point I'm not sure how you could believe that.

A couple of years later, do you still believe that foom will happen any year now?

if somehow you don't think foom is going to happen any year now, you also would think alignment is easy

Not necessarily. Given an expectation that previously somewhat effective alignment techniques stop working at AGI/ASI thresholds, and the no second chances issue with getting experimental feedback, alignment doesn't become much easier even if it's decades away. In that hypothetical there is time to build up theory for some chance of finding things that are helpful in advance, but it's not predictably crucial.

I believe you're saying that if foom is more than a few years away, it becomes easy to solve the alignment problem before then. I certainly think it becomes easier.

But the view that "foom more than a few years away -> the alignment problem is easy" is not the one I expressed, which contained among other highly tentative assertions: "the alignment problem is hard -> foom more than a few years away", and the two are opposed in the sense that they have different truth values when alignment is hard. The distinction here is that the chances we will solve the alignment problem depend on the time to takeoff, and are not equivalent to the difficulty of the alignment problem.

So, you mentioned a causal link between time to foom and chances of solving alignment, which I agree on, but I am also asserting a "causal link" between difficulty of the alignment problem and time to foom (though the counterfactuals here may not be as well defined).

As for how you could possibly believe foom is not going to happen any year now: My opinion depends on precisely what you mean by foom and by "any year now" but I think I outlined scenarios where it takes ~25 years and ~60 years. Do you have a reason to think both of those are unlikely? It seems to me that hard takeoff within ~5 years relies on the assumptions I mentioned about recursive algorithmic improvement taking place near human level, and seems plausible, but I am not confident it will happen. How surprised will you be if foom doesn't happen within 5 years?

I do expect the next 10 years to be pretty strange, but under the assumptions of the ~60 year scenario the status quo may not be completely upset that soon.