Summary

From the piece:

Earlier this year I decided to take a few weeks to figure out what I think about the existential risk from Artificial Superintelligence (ASI xrisk). It turned out to be much more difficult than I thought. After several months of reading, thinking, and talking with people, what follows is a discussion of a few observations arising during this exploration, including:

  • Three ASI xrisk persuasion paradoxes, which make it intrinsically difficult to present strong evidence either for or against ASI xrisk. The lack of such compelling evidence is part of the reason there is such strong disagreement about ASI xrisk, with people often (understandably) relying instead on prior beliefs, self-interest, and tribal reasoning to decide their opinions.
  • The alignment dilemma: should someone concerned with xrisk contribute to concrete alignment work, since it's the only way we can hope to build safe systems; or should they refuse to do such work, as contributing to accelerating a bad outcome? Part of a broader discussion of the accelerationist character of much AI alignment work, so capabilities / alignment is a false dichotomy.
  • The doomsday question: are there recipes for ruin -- simple, easily executed, immensely destructive recipes that could end humanity, or wreak catastrophic world-changing damage?
  • What bottlenecks are there on ASI speeding up scientific discovery? And, in particular: is it possible for ASI to discover new levels of emergent phenomena, latent in existing theories?

Excerpts

Here are the passages I thought were interesting enough to tweet about:

"So, what's your probability of doom?" I think the concept is badly misleading. The outcomes humanity gets depend on choices we can make. We can make choices that make doom almost inevitable, on a timescale of decades – indeed, we don't need ASI for that, we can likely4 arrange it in other ways (nukes, engineered viruses, …). We can also make choices that make doom extremely unlikely. The trick is to figure out what's likely to lead to flourishing, and to do those things. The term "probability of doom" began frustrating me after starting to routinely hear people at AI companies use it fatalistically, ignoring the fact that their choices can change the outcomes. "Probability of doom" is an example of a conceptual hazard5 – a case where merely using the concept may lead to mistakes in your thinking. Its main use seems to be as marketing: if widely-respected people say forcefully that they have a high or low probability of doom, that may cause other people to stop and consider why. But I dislike concepts which are good for marketing, but bad for understanding; they foster collective misunderstanding, and are likely to eventually lead to collective errors in action.

 

With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today's AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There's an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he's only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.

Over the past decade I've met many AI safety people who speak as though "AI capabilities" and "AI safety/alignment" work is a dichotomy. They talk in terms of wanting to "move" capabilities researchers into alignment. But most concrete alignment work is capabilities work. It's a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.

 

"Does this mean you oppose such practical work on alignment?" No! Not exactly. Rather, I'm pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it's only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn't the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it's making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don't proposition.

 

"What are those intrinsic reasons it's hard to make a case either for or against xrisk?" There are three xrisk persuasion paradoxes that make it difficult. Very briefly, these are:

  1. The most direct way to make a strong argument for xrisk is to convincingly describe a detailed concrete pathway to extinction. The more concretely you describe the steps, the better the case for xrisk. But of course, any "progress" in improving such an argument actually creates xrisk. "Here's detailed instructions for how almost anyone can easily and inexpensively create an antimatter bomb: [convincing, verifiable instructions]" makes a more compelling argument for xrisk than speculating that: "An ASI might come up with a cheap and easy recipe by which almost anyone can easily create antimatter bombs." Or perhaps you make "progress" by filling in a few of the intermediate steps an ASI might have to do. Maybe you show that antimatter is a little easier to make that one might a priori have thought. Of course, you should avoid making such arguments entirely, or working on it. It's a much more extreme version of your boss asking you to make a case for why you should be fired: that's a very good time to exhibit strategic incompetence. The case for ASI xrisk is often made in very handwavy and incomplete ways; critics of xrisk then dismiss those vague arguments. I recently heard an AI investor complain: "The doomers never describe in convincing detail how things will go bad". I certainly understand their frustration; at the same time, that vagueness is something to celebrate and preserve.
  2. Any sufficiently strong argument for xrisk will likely alter human actions in ways that avert xrisk. The stronger the argument, paradoxically, the more likely it is to avert xrisk. Suppose that in the late 1930s or early 1940s someone had found and publicized a truly convincing argument that nuclear weapons would set the world's atmosphere on fire. If that was the case then the bombs would have been much less likely to have been developed and used. Similarly, as our understanding of human-caused climate change has improved it has gradually caused enormous changes in human action. As one of many examples, in recent years our use of renewable energy has typically grown at a rate of about 15% per year, whereas the rate of fossil fuel growth is no more than a few percent per year, and sometimes much less. That relative increase is not due entirely to climate fears, but those fears have certainly helped drive investment and progress in renewables.
  3. By definition, any pathway to xrisk which we can describe in detail doesn't require superhuman intelligence. A variation of this problem shows up often in fiction: it is difficult for authors to convincingly write characters who are far smarter than the author. I love Vernor Vinge's description of this as a general barrier making it hard to write science fiction, "an opaque wall across the future" in Vinge's terms16. Of course, it's not an entirely opaque wall, in that any future ASI will be subject to many constraints we can anticipate today. They won't be able to prove that 2+2=5; they won't be able to violate the laws of physics; they likely (absent very unexpected changes) won't be able to solve the halting problem or to solve NP-complete problems in polynomial time.

 

"So, when will ASI be able to think its way to new discoveries?" There's a flipside to the above, which is that ASI can be expected to excel in situations where we already have extremely accurate predictive theories; the contingencies are already known and incorporated into the theory, in detail. Indeed, there are already cases where humanity has used such theories to great advantage to make substantial further "progress"32, mostly through thinking ("theory and/or simulation") alone, perhaps augmented with a little experiment:

  • The first atomic bombs were designed in considerable part using theory and simulation, built on models of neutrons and the nucleus which had been developed in the 1930s. Experiment was still required, but it's remarkable the extent to which theorists thinking drove the design of the bomb.
  • The first hydrogen bombs relied even more heavily on theory, in part because the conditions inside an atomic bomb – used to trigger the hydrogen bomb – were so unusual that they were difficult to study experimentally. Indeed, the very first calculation carried out on the ENIAC computer was a theoretical simulation of the hydrogen bomb.
  • The first stealth fighter was designed principally through theory and simulation, though it was done in conjunction with some physical experimentation.
  • Bose-Einstein condensation was predicted in advance through theory alone. I am told that in a seminar Phil Anderson, originator of the term "emergence", once said that emergent phenomena were so surprising as to be impossible to predict in advance. Someone in the audience at the seminar said "what about Bose-Einstein condensation?" Anderson shot back that that didn't count, since Einstein was a genius33.
  • LIGO was driven by and relied upon theory and simulation to a staggering extent; a lot of experimental input was still necessary to characterize and suppress the noise sources, which were not fully understood or characterized in advance.
  • Many remarkable phenomena have been predicted in advance based on extant theories, including lasing, the quantum Hall (but not fractional quantum Hall) effect, quantum teleportation, topological quantum computing, and others.
  • It's intriguing to think of striking new phenomena within computers as examples of this general pattern. Something like, for example, public-key cryptography seems to me an instance of a scientific discovery made entirely through theory, grounded in some extant theoretical system (in that case, the Turing machine theory of computation, as well as some broad notions of cryptography).
New Comment
12 comments, sorted by Click to highlight new comments since: Today at 7:00 PM
[-]ryan_b7mo2214

"So, what's your probability of doom?" I think the concept is badly misleading.

Thank heavens, I am not alone

I'm curious why you think that.

My probability that there will be any human alive 100 years from now is .07. If MIRI were magically given effective control over all AI research starting now or if all AI research were magically stopped somehow, my probability would change to .98.

Is what I just wrote also badly misleading? Surely your objection is not to ambiguity in the definition of "alive", but I'm failing to imagine what your objection might be unless we think very differently about how probability works.

(I'm interested in answers from anyone.)

[-]ryan_b7mo100

What you wrote is not misleading at all. For example, I was able to glean that you are thinking over a timeline of a century, and generally agree with MIRI's model of the AI problem. My objection to p(doom) is that none of those crucial details were included in the numbers themselves. In practice, the only interpretation I am seeing people actually use it for is as a signpost for whether someone is doomer or e/acc.

More specifically, p(doom) fails to communicate anything useful because there is so much uncertainty in the models. The differences in the probabilities people assign has a lot less to do with different assessments of the same thing than they do with assessing wildly different things.

Consider for a moment an alternative system, where we don't talk about probability at all. Instead, we can take a couple of important dimensions of the alignment problem: I propose that these two dimensions be timelines, which is to say whether you expect a dangerous AGI to arrive in a short time or in a long time, and tractability of alignment, which is whether you expect aligning an AGI to be hard or to be easy. If all we asked people was which side of the origin they were on in both dimensions, we could symbolize this as + (meaning we have a lot of time, or alignment is an easy problem) and - (meaning we don't have a lot of time, or alignment is a hard problem). This gives us four available answers:

doom++

doom+-

doom-+

doom--

I claim this gives us much more information than p(doom)=.97 because a scalar number tells you nothing about why. The positive and negative on dimensions give us a two-dimensional why: doom-- means the alignment problem is hard and we have very little time in which to solve it. It neatly breaks down into a four quadrant graph for visually representing fundamental areas of agreement/disagreement.

In short, the probability calculations have no useful meaning unless they are being run on at least similar models; what we should be doing instead is finding ways to expose our personal models clearly and quickly. I think this will lead to better conversations, even in such limited domains as twitter.

With all that said: practical alignment work is extremely accelerationist. If ChatGPT had behaved like Tay, AI would still be getting minor mentions on page 19 of The New York Times. These alignment techniques play a role in AI somewhat like the systems used to control when a nuclear bomb goes off. If such bombs just went off at random, no-one would build nuclear bombs, and there would be no nuclear threat to humanity. Practical alignment work makes today's AI systems far more attractive to customers, far more usable as a platform for building other systems, far more profitable as a target for investors, and far more palatable to governments. The net result is that practical alignment work is accelerationist. There's an extremely thoughtful essay by Paul Christiano, one of the pioneers of both RLHF and AI safety, where he addresses the question of whether he regrets working on RLHF, given the acceleration it has caused. I admire the self-reflection and integrity of the essay, but ultimately I think, like many of the commenters on the essay, that he's only partially facing up to the fact that his work will considerably hasten ASI, including extremely dangerous systems.

Over the past decade I've met many AI safety people who speak as though "AI capabilities" and "AI safety/alignment" work is a dichotomy. They talk in terms of wanting to "move" capabilities researchers into alignment. But most concrete alignment work is capabilities work. It's a false dichotomy, and another example of how a conceptual error can lead a field astray. Fortunately, many safety people now understand this, but I still sometimes see the false dichotomy misleading people, sometimes even causing systematic effects through bad funding decisions.

"Does this mean you oppose such practical work on alignment?" No! Not exactly. Rather, I'm pointing out an alignment dilemma: do you participate in practical, concrete alignment work, on the grounds that it's only by doing such work that humanity has a chance to build safe systems? Or do you avoid participating in such work, viewing it as accelerating an almost certainly bad outcome, for a very small (or non-existent) improvement in chances the outcome will be good? Note that this dilemma isn't the same as the by-now common assertion that alignment work is intrinsically accelerationist. Rather, it's making a different-albeit-related point, which is that if you take ASI xrisk seriously, then alignment work is a damned-if-you-do-damned-if-you-don't proposition.

I think this is sort of a flipside to the following point: Alignment work is incentivized as a side effect of capabilities, and there is reason to believe that alignment and capabilities can live together without either of them being destroyed. The best example really comes down to the jailbreak example, where the jailbreaker has aligned it to them, and controls the AI. The AI doesn't jailbreak itself and is unaligned, instead the alignment/control is transferred to the jailbreaker. We truly do live in a regime where alignment is pretty easy, at least for LLMs. And that's good news compared to AI pessimist views.

The tweet is below:

https://twitter.com/QuintinPope5/status/1702554175526084767

This also is important, in the sense that alignment progress will naturally raise misuse risk, and solutions to the control problem look very different from solutions to the misuse problems of AI, and one implication is that it's far less bad to accelerate if misuse is the main concern and can actually look very positive.

This is a point Simeon raised in this link, where he states a tradeoff between misuse and misalignment concerns here:

https://www.lesswrong.com/posts/oadiC5jmptAbJi6mS/the-cruel-trade-off-between-ai-misuse-and-ai-x-risk-concerns

So this means that it is very plausible that as the control problem/misalignment is solved, misuse risk can be increased, which is a different tradeoff than what is pictured here.

There isn't a way to comment there, so I'm commenting here:

--The author, like many others, misunderstands what people mean when they talk about capabilities vs. alignment. They do not mean that everything is either one or the other, and that nothing is both.
--Just because humanity can affect something doesn't mean it is bad to have a credence in it. P(doom) is a very important concept I think which I am glad people are going around discussing. It's analogous to asking "Does it look like we are going to win the war?" and "Is that new coronavirus we hear about in the news going to overwhelm the hospital system in this country?" In both cases, whether or not it happens depends hugely on choices humanity makes.
--It looks like this is a thoughtful and insightful piece with lots of good bits which will stick in my mind. E.g. the discussion of how RLHF accelerates ASI timelines, the discussion of situations in which we have accurate predictive theories already, and the persuasion paradox about how people worried about AI x-risk can't be too specific or else they make it (slightly) more likely.
 

The author, like many others, misunderstands what people mean when they talk about capabilities vs. alignment. They do not mean that everything is either one or the other, and that nothing is both

I had a similar reaction, which made me want to go looking for the source of disagreement. Do you have a post or thread that comes to mind which makes this distinction well? Most of what I am able to find just sort of gestures at some tradeoffs, which seems like a situation where we would expect the kind of misunderstanding you describe.

Yep, it’s nicely packaged right here:

To all doing that (directly and purposefully for its own sake, rather than as a mournful negative externality to alignment research): I request you stop.

Ehh, it's not long enough, doesn't explain things as well as it could.

Winning or losing a war kinda binary.

Will a pandemic get to my country is a matter of degree, since in principle you can have a pandemic that killed 90% of counterfactual economic activity in one country break containment but only destroy 10% in your country.

"Alignment" or "transition to TAI" of any kind is way further from "coinflip" than either of these, so if you think doomcoin is salvageable or want to defend its virtues you need way different reference classes.

Think about the ways in which winning or losing a war isn't binary-- lots of ways for implementation details of an agreement to effect your life as a citizen of one of the countries. AI is like this but even further-- all the different kinds of outcomes, how central or unilateral are important moments, which values end up being imposed on the future and at what resolution, etc. People who think "we have a doomcoin toss coming up, now argue about the p(heads)" are not gonna think about this stuff!

To me, "p(doom)" is a memetic PITA as bad as "the real unaligned AI was the corporations/calitalism", so I'm excited that you're defending it! Usually people tell me "yeah you're right it's not a defensible frame haha"

Interesting, thanks. Yeah, I currently think the range of possible outcomes in warfare seems to be more smeared out, across a variety of different results, than the range of possible outcomes for humanity with respect to AGI. The bulk of the probability mass in the AGI case, IMO, is concentrated in "Total victory of unaligned, not-near-miss AGIs" and then there are smaller chunks concentrated in "Total victory of unaligned, near-miss AGIs" (near-miss means what they care about is similar enough to what we care about that it is either noticeably better, or noticeably worse, than human extinction.) and of course "human victory," which can itself be subdivided depending on the details of how that goes.

Whereas with warfare, there's almost a continuous range of outcomes ranging from "total annihilation and/or enslavement of our people" to "total victory" with pretty much everything in between a live possibility, and indeed some sort of negotiated settlement more likely than not.

I do agree that there are a variety of different outcomes with AGI, but I think if people think seriously about the spread of outcomes (instead of being daunted and deciding not to think about it because it's so speculative) they'll conclude that they fall into the buckets I described.

Separately, I think that even if it was less binary than warfare, it would still be good to talk about p(doom). I think it's pretty helpful for orienting people & also I think a lot of harm comes from people having insufficiently high p(doom). Like, a lot of people are basically feeling/thinking "yeah it looks like things could go wrong but probably things will be fine probably we'll figure it out, so I'm going to keep working on capabilities at the AGI lab and/or keep building status and prestige and influence and not rock the boat too much because who knows what the future might bring but anyhow we don't want to do anything drastic that would get us ridiculed and excluded now." If they are actually correct that there's, say, a 5% chance of AI doom, coming from worlds in which things are harder than we expect and an unfortunate chain of events occurs and people make a bunch of mistakes or bad people seize power, maybe something in this vicinity is justified. But if instead we are in a situation where doom is the default and we need a bunch of unlikely things to happen and/or a bunch of people to wake up and work very hard and very smart and coordinate well, in order to NOT suffer unaligned AI takeover...

[-]TAG7mo20

The most direct way to make a strong argument for xrisk is to convincingly describe a detailed concrete pathway to extinction. The more concretely you describe the steps, the better the case for xrisk. But of course, any “progress” in improving such an argument actually creates xrisk.

I think that's a false dichotomy: a good argument is nowhere as detailed as a basic blueprint.