Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Subtitle: A partial defense of high-confidence AGI doom predictions.

Introduction

Consider these two kinds of accident scenarios:

  1. In a default-success scenario, accidents are rare. For example, modern aviation is very safe thanks to decades of engineering efforts and a safety culture (e.g. the widespread use of checklists). When something goes wrong, it is often due to multiple independent failures that combine to cause a disaster (e.g. bad weather + communication failures + pilot not following checklist correctly).
  2. In a default-failure scenario, accidents are the norm. For example, when I write a program to do something I haven’t done many times already, it usually fails the first time I try it. It then goes on to fail the second time and the third time as well. Here, failure on the first try is overdetermined―even if I fix the first bug, the second bug is still, independently, enough to cause the program to crash. This is typical in software engineering, and it can take many iterations and tests to move into the default-success regime.

See also: conjuctive vs disjunctive risk scenarios.

Default-success scenarios include most engineering tasks that we have lots of experience with and know how to do well: building bridges, building skyscrapers, etc. Default-failure scenarios, as far as I can tell, come in two kinds: scenarios in which we’re trying to do something for the first time (rocket test launches, prototypes, new technologies) and scenarios in which there is a competent adversary that is trying to break the system, as in computer security.[1]

Predictions on AGI risk

In the following, I use P(doom) to refer to the probability of an AGI takeover and / or human extinction due to the development of AGI.

I often encounter the following argument against predictions of AGI catastrophes:

Alice: We seem to be on track to build an AGI smarter than humans. We don’t know how to solve the technical problem of building an AGI we can control, or the political problem of convincing people to not build AGI. Every plausible scenario I’ve ever thought or heard of leads to AGI takeover. In my estimate, P(doom) is [high number].

Bob: I disagree. It’s overconfident to estimate high P(doom). Humans are usually bad at predicting the future, especially when it comes to novel technologies like AGI. When you account for how uncertain your predictions are, your estimate should be at most [low number].”

I'm being vague about the numbers because I've seen Bob's argument made in many different situations. In one recent conversation I witnessed, the Bob-Alice split was P(doom) 0.5% vs. ~10%, and in another discussion it was 10% vs. 90%.

My main claim is that Alice and Bob don’t actually disagree about how uncertain or hard to predict the future is―instead, they disagree about to what degree AGI risk is default-success vs. default-failure. If AGI risk is (mostly) default-failure, then uncertainty is a reason for pessimism rather than optimism, and Alice is right to predict failure.

In this sense I think Bob is missing the point. Bob claims that Alice is not sufficiently uncertain about her AI predictions, or has not integrated her uncertainty into her estimate well enough. This is not necessarily true; it may just be that Alice’s uncertainty about her reasoning doesn't make her much more optimistic.

Instead of trying to refute Alice from general principles, I think Bob should instead point to concrete reasons for optimism (for example, Bob could say “for reasons A, B, and C it is likely that we can coordinate on not building AGI for the next 40 years and solve alignment in the meantime”).

Uncertainty does not (necessarily) mean you should be more optimistic

Many people are skeptical of the ‘default-failure’ frame, so I'll give a bit more color here by listing some reasons why I think Bob's argument is wrong / unproductive. I won’t go into detail about why AGI risk specifically might be a default-failure scenario; you can find a summary of those arguments in Nate Soares’ post on why AGI ruin is likely.

  1. It’s true that the future is often hard to predict; for example, experts often fail to predict technological developments. This is not a reason for optimism. It would be kind of weird if it was! Humans are generally bad at predicting the future, especially for technological progress, and this is bad news for AI safety.
    1. In particular: if all the AI researchers are uncertain about what will happen, that is a bad sign much in the same way that it would be a bad sign if none of your security engineers understood the system they are supposed to secure.
    2. Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
  2. To believe P(doom) is high, all you really need to be convinced of is that the default outcome for messing up superhuman AGI is human extinction, and that we’re not prepared. Our understanding here is incomplete but still relatively good compared to details that are harder to predict, e.g. when exactly AGI will arrive or what early forms of AGI will look like.
  3. It is not always wrong to make high-confidence disaster predictions. For example, people saying “covid will be a disaster with high (~90%) probability” in February 2020 were predictably correct, even though covid was a very novel situation. There was a lot of uncertainty, and the people who predicted disaster usually got the details wrong like everyone else, but the overall picture was still correct because the details didn’t matter much.
  4. A confidence of 90% is not actually much harder to achieve than 10%, relative to the baseline extinction risk for a new technology which is close to 0%. An estimate of P(doom) = 30% already leans very heavily on your inside view of the risks involved; you don’t need to trust your reasoning all that much more to estimate 90% instead.
  5. Put differently: there’s no reason in particular why Bob's uncertainty argument should cap your confidence at ~80%, rather than 1% or 0.1%.
    1. (It seems totally reasonable to me for a first reaction to AI X-risk to be “Eh I don’t know, it’s an interesting idea and I’ll think more on it, but it does seem pretty crazy; if I had to estimate P(doom) right now I would say ~0.1%, though I would prefer not to give a number at all.” Followed, to be clear, by rapid updates in favor of high p(doom), though not necessarily 90%; I think 90% makes sense for people who have slammed their head against the difficulties involved, and noticed a pattern where the wall they’re slamming their heads against is pretty hard and doesn’t have visible weak spots; but otherwise you wouldn’t necessarily be that pessimistic.)
  6. More generally: estimates around 90% aren’t all that “confident”. If you’re well-calibrated, changing your mind about something that you estimate to be 90% likely is something that happens all the time. So P(X) = 90% means “I expect X to happen, though I’m happy to change my mind and in fact regularly do change my mind about claims like this”.
  7. It makes sense to be uncertain about your beliefs, and about whether you thought of all the relevant things (usually you didn’t). Rather than be generically uncertain about everything, it’s usually better to be uncertain about specific parts of your model.
    1. For example: I’m uncertain about the behavior and capability profile of the first AI that surpasses humans in scientific research. This makes me more pessimistic about alignment relative to a baseline where I was certain, because any strategy that depends on specific assumptions about the capabilities of this AI is unlikely to work.
    2. For a second example: I think there probably won’t be any international ban or regulation on large training runs that lengthens timelines by >10 years, but I’m pretty uncertain. This makes me more optimistic relative to a baseline where I was certain governments would do nothing.
  8. Put differently: most of your uncertainty about beliefs should be part of your model, not some external thing that magically pushes all your beliefs towards 50% or 0% or 100%.

Some things I’m not saying

This part is me hedging my claims. Feel free to skip if that seems like a boring thing to read.

I don’t personally estimate P(doom) above 90%.

I’m also not saying there are no reasons to be optimistic. I’m claiming that reasons for optimism should usually be concrete arguments about possible ways to avoid doom. For example, Paul Christiano argues for a somewhat lower than 90% P(doom) here, and I think the general shape of his argument makes sense, in contrast to Bob’s above.

I do think there is a correct version of the argument that, if your model says P(outcome) = 0.99, model uncertainty will generally be a reason to update downwards. I think people already take that into account when stating high P(doom) estimates. Here’s a sketch of a plausible reasoning (summarized and not my numbers, but I do have similar reasoning, and I don’t think the numbers are crazy):

  1. Almost every time I imagine a concrete scenario for how AGI development might go, that leads to an outcome where humans go extinct.
  2. I can imagine some ways in which things go well, but they seem pretty fanciful; for example a sudden international treaty that forbids large training runs and successfully enforces this. (I do expect there’ll be other government efforts, but I don’t expect those to change things much for the better). So my “within-model” prediction is p(doom) = 0.99.
  3. My model is almost certainly wrong. Sadly, for most scenarios I can imagine, being wrong would only make things worse. I’m literally a safety researcher; me being totally wrong about e.g. what the first AGI looks like is not a good sign for safety (and I don’t expect other safety researchers to have better models). Almost all surprises are bad.
    1. Analogy: if I’m in charge of software security for a company, and my impression is that the system is almost certainly insecure, it is not a good argument to say “well you don’t completely understand the system, so you might be wrong!” ― I may be wrong, but being wrong does not bode well for our security.
  4. That said: while technical surprises are probably bad, there’s other kinds of positive surprises we could get, for example: more progress on AI safety than expected, better interpretability methods, more uptake of AI risk concerns by the broader ML community, more government action on regulating AI.
    1. In fact, there are some kinds of cumulative surprises that could add up to save us; as an example, enough regulation of AI could lead to ~10y longer timelines; more progress than expected in interpretability could lead to more compelling demonstrations of misalignment; more uptake of AI risk by the broader scientific community might lead to more safety progress and an overall more careful approach to AGI.
    2. Note that this is not an update made from pure uncertainty―there is a concrete story here about how exactly surprises might actually be helpful, rather than bad. It’s not a particularly great story either; it needs many things to go better than expected.
  5. Now, that particular story is not likely at all. But it seems like there are many stories in that general category, such that the total likelihood of a good surprise adds up to 10%.
    1. Note the basic expectation of ‘surprises are often bad’ still applies. Not knowing how governments or society will react to AI is hardly helpful for the people who are currently trying to get governments or society to react in a useful way.
  6. So my overall, all-things-considered p(doom) is 90%, mostly due to a kind of sketchy downwards-update due to model uncertainty, without which the estimate would be around 99%.
  7. It’s debatable how large the downwards update here should be―it could reasonably be more or less than 10%, and it’s plausible that we’re in the kind of domain where small quantified probability updates aren’t very useful at all.

I don’t mean to say that the reasoning here is the only reasonable version out there. It depends a lot on how likely you think various definitely-useful surprises are, like long timelines to AGI and slow progress after proto-AGI. But I do think it is wrong to call high P(doom) estimates overconfident without any further more detailed criticism.

Finally, I haven’t given an explicit argument for AGI risk; there’s a lot of that elsewhere.

  1. ^

    Note how AGI somehow manages to satisfy both of these criteria at once.

New to LessWrong?

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 2:43 PM

I don't see how you get default failure without a model. In fact, I don’t see how you get there without the standard model, where an accident means you get a super intelligence with a random goal from an unfriendly prior - but that’s precisely the model that is being contested!

I can kiiinda see default 50-50 as "model free", though I'm not sure if I buy it.

It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).

If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.

Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).

I think a defence of high P(doom) that addresses the issue above would be quite valuable.

Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!

Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".

Here's where I think the "doomers vs accelerationists" crux can collapse to.  

On real computers built by humans, using real noisy data accessible to humans, 

(1) how powerful in utility terms will an ASI be

(2) what will that ASI's advantage over carefully constrained, stateless ASIs be, that humans have on their side, who are unable to tell if their inputs come from the training set or if they are currently operating in the real world.  

 

The crux in (1) comes from the current empirical observations of power laws, and just thinking about what intelligence is.  It's not magic, as an agent in the real world, intelligence is just a Policy between inputs and outputs, with policy updates as part of the cycle.  

Obviously the policy cannot operate on more bits of precision than the inputs.  Obviously it can't emit more bits of precision than the actuator output resolution.  This has real world consequences, see https://www.lesswrong.com/posts/qpgkttrxkvGrH9BRr/superintelligence-is-not-omniscience .  And possibly the policy quality improves by the log of compute, and on an increasing number of problems, there is zero benefit to a smarter policy.

For example, on many medical questions, current human knowledge is so noisy and unreliable that the best policy known is a decision tree.  The game 'tic tac toe' can be solved by a trivial policy, and an ASI will have no advantage on it.  Intelligence doesn't give a benefit above a base level on an increasing set of problems that scales with the amount of intelligence an agent has.

This is the same principle as Amdahl's law, "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".

So if "improved part" means "above human intelligence", Amdahl's law applies.  

 

The crux in (2) falls from 1.  If intelligence has diminishing returns, then you can gain a large fraction of the benefits of increased intelligence with a system substantially stupider than the smartest one you can possibly build.  


More empirical data can answer who's right, and assuming the accelerationists are correct, they will know they were correct for years.  If the doomers were correct, well.  

Your model assumes lot about the nature of AGI. Sure if you jump directly to “we’ve created coherent, agential, strategic strong AGI, what happens now?” you end up with a lot of default failure modes. The cruxes of disagreement are along what does AGI actually look like in practice and what are the circumstances around it’s creation?

  • Is it Agential? Does it have strategic planning capabilities that it tries to act on in the real world? Current systems don’t look like this.

  • Is it coherent? Even if it has the capability to strategically plan is it able to coherently pursue those goals over time? Current systems don’t even have the concept of time and there is some reason to believe that coherence and intelligence may have an inverse correlation.

  • Do we get successive chances to work on aligning a system? If “AGI” was derived from scaling LLMs and adding cognitive scaffolding doesn’t it seem highly likely they will both be interpretable and steerable given their use of natural language and ability to iterate on failures?

  • Is “kindness” truly completely orthogonal to intelligence? If there is even a slight positive correlation the future could look very different. Paul Christianio made an argument about this on a thread recently.

I think part of the challenge is that AGI is a very nebulous term and presupposing an agential, strategic, coherent AGI involves assuming a lot of steps in between. I think a lot of the disagreements rely on what the properties of the AGI are rather than specific claims about the likelihood of successful alignment. And there seems to be a lot of uncertainty on how this technology actually ends up developing that’s not accounted for in many of the standard AI X-Risk Models

One of the takehome lessons from ChaosGPT and AutoGPT is that there'll likely end up being agential AIs, even if the original AI wasn't particularly agentic.

AutoGPT is an excellent demonstration of the point. Ask someone on this forum 5 years ago whether they think AGI might be a series of next token predictors strung together with  modular cognition occurring in English and they would have called you insane. 

Yet if that is how we get something close to AGI it seems like a best case scenario since intrepretability is solved by default and you can measure alignment progress very easily. 

Reality is weird in very unexpected ways. 

To restate what other people have said- the uncertainty is with the assumptions, not the nature of the world that would result if the assumptions were true.

To analogize- it's like we're imagining a massive complex bomb could exist in the future made out of a hypothesized highly reactive chemical.

The uncertainty that influences p(DOOM) isn't 'maybe the bomb will actually be very easy to defuse,' or 'maybe nobody will touch the bomb and we can just leave it there,' it's 'maybe the chemical isn't manufacturable,' 'maybe the chemical couldn't be stored in the first place,' or 'maybe the chemical just wouldn't be reactive at all.'

So to transfer back from the analogy, you are saying the uncertainty is in "maybe it's not possible to create a God-like AI" and "maybe people won't create a God-like AI" and "maybe a God-like AI won't do anything"?

Another one, corresponding to the analogy in the chemical not being reactive at all, is the possbility that even very strong AIs are fundamentally very easy to align by default, for any number of reasons.