Introduction

In this post, I will discuss the cruel trade-off between misuse concerns and X-risks (accidental risks) regarding racing. It is important to note that this post does not advocate for one risk being more plausible than the other, nor does it make any normative statements. The purpose is to analyze the outcome of a particular worldview.

In one sentence, the key claim is: “If an entity in power to develop AGI is mostly worried about misuse AND think that they’re the best entity (morally speaking) in their reference class, it is good to race. This is bad for accidental risks”. 

Epistemic status: I’ve written up this post pretty quickly, after a conversation where it didn’t seem clear to someone. I'm confident about the general claim, less about specific claims. 

Why If One Is Worried about Misuse They Should Race? 

  1. a. AGI is the most powerful thing ever created. Those who will control that will have an unprecedented level of control over everyone else. So if you end up with someone with bad intents controlling that, that’s probably the end for everyone else. It also leaves plenty of room for scenarios like stable authoritarianism. 
    b. Thus, if you’re at the head of an AGI lab and you’re genuinely worried about some other AGI labs’ CEO’s ethics, you would want to ensure that they don't develop AGI before you do. You could be worried about individuals getting power or certain cultures or governments getting power (the US fearing China, or a lab fearing another) 
  2. a. In a world where weak AGIs accessible to a wide range of people enable the creation of bioweapons or facilitate massive cyberattacks, there is a danger of reaching a point where everyone can kill everyone else but there is not yet a powerful "defensive AGI" to prevent this via global deterence & surveillance.
    b. If you view your lab as responsible and if you’re primarily concerned about misuse, it makes sense to race towards the development of a defensive AGI of a sufficient power to avoid the dangerous scenario mentioned above.
  3. a. Preventing jailbreaks is hard. So you may want to reach AGI as early as possible with as few deployments as possible. So the earlier you get to AGI, the better the world is. 
    b. Hence you should race as fast as possible[1] internally and deploy as little as possible externally to get the necessary capital to be able to reach the goalpost as early as possible. 

And to be clear, here the cause of the racing is beliefs on the world, not secretely evil intentions

Some Reasons Why Racing is Bad for Accidents

On the other hand, for those who’re worried about accidental risks and who are pessimistic about our chances of solving those issues in a short amount of time (“alignment is hard”), racing is one of, if not the worst thing: 

  1. Racing lets very little time to address problems as they arise. They’ll incentivize labs to patch problems in the cheapest way that works, e.g. fine-tuning. This is differentially more worrying for accident than for misuse because while you can control misuse with access restrictions, you can’t control accidents. 
  2. Racing doesn’t let time to understand what’s going on inside the models. It also incentivizes to build the simplest AGI that works and is not easily misused rather than something which has strong foundations for making sure it doesn’t cause accidents. Not understanding what’s going on is most worrying for those concerned about deception scenarios. 
  3. Racing leads to cut corners internally on red teaming for accidents, making sure the model is not deceptive etc. 

 

It’s important to note that except that point (we should race really fast), most other measures are the same to solve misuse problems and accidental risks problems, i.e. auditing, licensing, developing models in close source, getting compute governance right, working to make models robust to jailbreaks etc. 

 

  1. ^

    "as fast as possible" includes constraints like "make your model non trivial to break to prevent misuse". The main problem is just that preventing misuse requires a priori much less engineering and intervention on the model itself than preventing accidents. 

New Comment
1 comment, sorted by Click to highlight new comments since:

To repurpose a quote from The Cincinnati Enquirer: The saying "AI X-risk is just one damn cruelty after another," is a gross overstatement. The damn cruelties overlap.

When I saw the title, I thought, "Oh no. Of course there would be a tradeoff between those two things, if for no other reason than precisely because I hadn't even thought about it and I would have hoped there wasn't one." Then as soon as I saw the question in the first header, the rest became obvious.

Thank you so much for writing this post. I'm glad I found it, even if months later. This tradeoff has a lot of implications for policy and outreach/messaging, as well as how I sort and internalize news in those domains.

Without having thought about it enough for an example: It sounds correct to me that in some contexts, appreciating both kinds of risk drives response in the same direction (toward more safety overall). But I have to agree now that in at least some important contexts, they drive in opposite directions.