AI Doom Is Not (Only) Disjunctive

NickGabs

Summary

Arguing about the conjunctivity vs disjunctivity of AI doom seems potentially unhelpful, as it may distract from crucial object level questions about the probabilities of particular conjuncts/disjuncts. However, I argue that if we are to use this frame, then we should consider the risk from any particular AGI project to be conjunctive and relatively unlikely but consider the risk from all AGI projects over time to be disjunctive and thus relatively more likely. If this framing is correct, then the sub-problems of scalable alignment in technical alignment and of limiting the proliferation of AGIs in alignment strategy/policy seem particularly important.

Epistemic Status: Speculative

Introduction

One common sub-topic in discussions of p(doom) is how "conjunctive" vs "disjunctive" AI risk is. To a large extent, I think that this is an unhelpful frame because any event can be expressed as an arbitrary number of disjunctions and/or conjunctions. Thus, while "conjunctive things are unlikely" is sometimes a good heuristic, it can also distract from crucial object-level probability assessments. In particular, when it is unclear how an event should be logically decomposed in a way such that the constituent events have "reasonable" probabilities, I don't think it makes much sense to have the meta-level debate about whether the event is conjunctive or disjunctive as opposed to arguing about what the probabilities of the particular conjuncts/disjuncts are (e. g. if you decompose an event into 10 conjuncts which each have p = .99, it is mistaken to infer that it is unlikely from the fact that it is conjunctive, and ditto for disjunctive events composed of disjuncts with low probabilities.)

For example, in his post on this subject, Nate Soares argues that AI doom is disjunctive from the fact that AI alignment is conjunctive, and then provides a conjunctive breakdown of alignment success. I think many of the conjuncts he proposes alignment success seem “reasonable” in the aforementioned sense (Nate definitely assigns lower probability to them than I do, but I agree that they are far from certain and somewhat uncorrelated with each other). However, others seem to be quite likely, particularly conditional on the other conjuncts resolving in an optimistic way (e. g. ”The world needs to admit of an AGI deployment strategy (compatible with realistic alignable-capabilities levels for early systems) that prevents the world from being destroyed if executed” seems >95% likely to me conditional on the other success conditions and “There need to be people who think of themselves as working on technical alignment, whose work is integrated with AGI development and is a central input into how AGI is developed and deployed” also seems like it has basically already happened) such that inferring that AI success is unlikely from this conjunctive breakdown seems unwarranted.

AI Doom is Both Conjunctive And Disjunctive

However, to the extent that this is a helpful frame, I think the best answer is that risk from any given AGI project is conjunctive, but that given that we will likely develop a series of increasingly powerful AGIs, the risk between all of those projects is disjunctive.

I think that risk from any given AGI project is conjunctive because I think you can break down the event of AI doom from any given project into conjuncts with probabilities that are not extremely high (>.8) such that their conjunction is relatively unlikely. It seems to me that for any given AGI project, at least 4 things need to occur for the AI to directly cause doom:

The AGI is agentic/consequentialist in such a way that instrumental convergence + misalignment leads it to have incentives to take over.
We fail to align the AGI.
We allow the AI to get access to real world resources/allow it to escape its "box" despite it being misaligned (this may happen for some AIs by default, in the sense that we might never "box" them in the first place, in which case this condition is trivial).
The AI is powerful enough that given (3) it could take over/cause doom (or do so in conjunction with the already existing AI systems in some kind of multipolar failure).

Off the top of my head, for the first AGI project, I would assign a probability of .35 to the first claim, .25 to the second conditional on the first, .75 to the third conditional on the first 2, and .8 to the fourth conditional on the first three, leading to a total probability of doom from the first AGI of 5.25%. Thus, under this model, doom due to the first AGI is "conjunctive"; none of the individual steps required for it are implausible, but their conjunction makes the event itself relatively unlikely. Moreover, I don't think many risk scenarios are missed by this breakdown. Unless non goal-directed systems can directly cause doom, goal directed but aligned systems will cause doom, or goal directed and misaligned systems will cause doom without ever getting access to substantial real world resources (including computers other than the one they are trained on, humans via manipulation, etc.; essentially, this would require them to be able to "directly" cause doom from the outputs they are intended to be able to access during training), it seems like these all of the conjuncts are necessary.

I also think it is reasonable to assign substantially different probabilities to these events than me, (particularly to (2), as I don't think we have much evidence on alignment difficulty yet). For example, I could imagine updating off of a not huge amount of evidence to assign probabilities of .65, .85, .85, and .95 to each of the events respectively (for the first AGI). However, this still leads to a p(doom) of <50% for the first AGI project, meaning that even under this much more pessimistic model, we are still more likely than not to survive it. Moreover, I think very high confidence that any of the conjuncts will happen (particularly the first 3) based on entirely speculative/theoretical arguments is unwarranted. Without responding to each of these arguments, it seems to me that predicting the future is too hard when we don't have precise gears-level models or very good reference classes to assign super confident probabilities in cases like these.

However, even if no particular AGI project poses a >50% risk, we need to ensure that at least one of the aforementioned conditions fails for every AI that we ever develop in order to avoid doom. Over time, I expect the probabilities of the different conditions to change in a way which makes their conjunction substantially less likely, mostly due to large reductions in (2) due to getting better at alignment through practice and the help of aligned AGIs and (3) due to getting better at monitoring for the same reasons, though I expect slight increases in (1); I am uncertain about (4), though I would weakly bet on it decreasing due to being able to fight off misaligned AGIs with aligned ones. However, there will likely be a very large number of AGIs built over time, and though their outcomes are highly correlated, there will be many chances to fuck up. Thus, my overall p(doom) is substantially higher than 5.25% (~25%).

Implications

In addition to deconfusing the conjunctive/disjunctive question, I also think that this breakdown reframes AI risk in a helpful way similar to how Stephen Casper frames the problem here. It is sometimes assumed that there will be a strong capabilities discontinuity around AGI such that there will be one pivotal alignment project (the “first AGI”) which will either lead to doom or utopia; indeed, until recently, this was my main model of AI risk. However, considering the broader reference class of threats from technology as discussed in the Stephen’s post as well as recent evidence in favor of slower/smoother takeoffs has made me think this model is less likely. Instead, I think we will need to succeed in aligning a series of AGIs, and that while each attempt is likely to go well, the probability of failure does not drop to 0 extremely rapidly.

This framing of AI risk has several high level implications for AI alignment. Firstly, it indicates that the sub-problem of scalable alignment is particularly important. This is because if we are relatively unlikely to die from the first AGI but our overall odds of doom are substantially higher, then most doom comes situations where our alignment techniques work at first but fail for future, more powerful AGIs. Secondly, it suggests that it will be crucial to limit the proliferation of the technology and resources needed to create AGIs. This is for similar reasons: if we are relatively unlikely to die from the first AGI, and if, conditional on succeeding at aligning the first AGI, the lab which aligned it has substantially increased odds of being able to align future AGIs (due to both being good at aligning AGI and having AGI assistance for future attempts), then most doom scenarios involve actors beyond that first lab failing to align their AI systems. Moreover, actors other than the research labs which specialize in building AGIs seem likely to be substantially less cautious, making it even more important to ensure that they do not get the opportunity.

LESSWRONG
LW