Society really cares about safety. Practically speaking, the binding constraint on deploying your AGI could well be your ability to align your AGI. Solving (scalable) alignment might be worth lots of $$$ and key to beating China.

Look, I really don't want Xi Jinping Thought to rule the world. If China gets AGI first, the ensuing rapid AI-powered scientific and technological progress could well give it a decisive advantage (cf potential for >30%/year economic growth with AGI). I think there's a very real specter of global authoritarianism here.[1] 

Or hey, maybe you just think AGI is cool. You want to go build amazing products and enable breakthrough science and solve the world’s problems.

So, race to AGI with reckless abandon then? At this point, people get into agonizing discussions about safety tradeoffs.[2] And many people just mood affiliate their way to an answer: "accelerate, progress go brrrr," or "AI scary, slow it down."

I see this much more practically. And, practically, society cares about safety, a lot. Do you actually think that you’ll be able to and allowed to deploy an AI system that has, say, a 10% chance of destroying all of humanity?[3]

Society has started waking up to AGI; like covid, the societal response will probably be a dumpster-fire, but it’ll also probably be quite intense. In many worlds, to deploy your AGI systems, people will need to be quite confident that your AGI won’t destroy the world.

Right now, we’re very much not on track to solve the alignment problem for superhuman AGI systems (“scalable alignment”)—but it’s a solvable problem, if we get our act together. I discuss this in my main post today (“Nobody’s on the ball on AGI alignment”). On the current trajectory, the binding constraint on deploying your AGI could well be your ability to align your AGI—and this alignment solution being unambiguous enough that there is consensus that it works.

Even if you just want to win the AGI race, you should probably want to invest much more heavily in solving this problem.

Things are going to get crazy, and people will pay attention

A mistake many people make when thinking about AGI is imagining a world that looks much like today, except for adding in a lab with a super powerful model. They ignore the endogenous societal response.

I and many others made this mistake with covid—we were freaking out in February 2020, and despairing that society didn’t seem to be even paying attention, let alone doing anything. But just a few weeks later, all of America went into an unprecedented lockdown. If we're actually on our way to AGI, things are going to get crazy. People are going to pay attention.

The wheels for this are already in motion. Remember how nobody paid any attention to AI 6 months ago, and now Bing chat/Sydney going awry is on the front page of the NYT, US senators are getting scared, and Yale econ professors are advocating $100B/year for AI safety? Well, imagine that, but 100x as we approach AGI.

AI safety is going mainstream. Everyone has been primed to be scared about rogue AI by science fiction; all the CEOs have secretly believed in AI risk for years but thought it was too weird to talk about it[4]; and the mainstream media loves to hate on tech companies. Probably there will be further, much scarier wakeup calls (not just misalignment, but also misuse and scary demos in evals). People already freaked out about GPT-4 using a TaskRabbit to solve a captcha—now imagine a demo of AI systems designing a new bioweapon or autonomously self-replicating on the internet, or people using AI coders to hack major institutions like the government or big banks. Already, a majority of the population says they fear AI risk and want FDA-style regulation in polls.

The discourse on it will be incredibly dumb—I can't wait for Ron DeSantis and Kamala Harris's 2028 presidential debate on AI safety—but you won't be able to escape it. (And as stupid as all of this will be, this sort of endogenous societal response is a big reason why I'm more optimistic on AI risk in general.)

The level of media scrutiny, public attention, internal employee pressure, self-regulation, government monitoring, etc. will be way too intense to ignore alignment concerns. We're seeing very early versions of self-regulation with initial AI risk evals efforts. But price in how intense it's all going to get. Do you think the US national security establishment won't get involved once they realize they have a technology more powerful than nukes on their hands? Do you think your board is going to let you release a model if the NYT is reporting in all caps that a large fraction of serious AI experts, prominent CEOs, and politicians think this could go haywire and start actually hurting people? 

Imagine if you tried submitting a drug application to the FDA with a similar risk profile.

A reasonable objection here is: “yes, we did lock down in response to covid, and that was pretty crazy, but also our response to covid was pretty incompetent across the board; it was more like random flailing than actually doing the most effective things; and it’s not even clear if the lockdowns were net-positive.”

I agree! The societal response to AGI will probably be a dumpster-fire.

But there will be a really intense response. I think it’s fairly likely that the cludgy response we do get is enough to throw serious sand into the gears on deployment—unless you have a convincing solution to (scalable) alignment. If anything, the example of lockdowns could point towards society responding in excessively cautious ways, heightening the returns to a convincing alignment solution even further. Yes, our response might also be totally ineffectual; this very much isn’t sufficient to make me sleep soundly at night. But in a large fraction of worlds, if you want to deploy your AGI, people are going to demand of you that we can be confident it’s safe.[5] 

The binding constraint on making AGI could be aligning it. You want an unambiguous solution, for which there is consensus that it’s safe.

You don’t even need xrisk concerns for alignment to become the binding constraint on your ability to deploy models. With current techniques, we’re very much not on track for being able to put basic guardrails on models as they become superhuman. Do you really think you’ll be able to deploy GPT-7 all across the economy if you can’t reliably ensure GPT-7 won’t break the law?

The thing is, aligning superhuman AGIs is a much harder problem than near-term alignment. Current alignment techniques rely on human supervision. But as models get superhuman, it will become impossible for humans to reliably supervise these models (e.g., imagine a model proposing a series of actions or 100,000 lines of code too complicated for humans to understand). If you can’t detect bad behavior, you can’t prevent it. (And rather than the “bad behavior” in question being “prevent the models from saying bad words,” as with near-term alignment, the bad behavior for superhuman models looks more like “prevent the models from trying a coup of the US government.”)

I think that aligning superhuman AGIs is a) doable, but b) nobody is on the ball right now—as discussed in my other post. The scalable alignment plans labs currently have (example) might work, but they sort of rely on “improvise in the moment, let’s cross our fingers and hope it works out.”

Even if that bet works out, the safety of your systems will probably be fairly ambiguous until very late—ambiguous enough that you won’t be able to deploy. When asked, “will your superhuman AGI go haywire?”, do you think people will accept “probably not?” for an answer?

If you want to win the AGI race, if you want to beat China, you’re probably going to need a better alignment plan. You want an alignment solution good enough to achieve a broad consensus that your superhuman AGI is safe. Ambiguity could be fatal to your ability to press ahead.

You might not like it, you might rage at everyone's excessive safetyism and wish it were different. But, practically speaking, you should be pretty interested in much more serious efforts to solve scalable alignment. Let’s not lose to China because in our fervor to race to AGI, we fail to invest in the alignment research practically necessary to actually deploy AGI.[6]

Thanks to Collin Burns, Holden Karnofsky and Dwarkesh Patel for comments on a draft.


  1. ^

     Though, for now, it seems that China is a few years behind, and the US AI chip export controls might considerably hamper them (great CSIS explainer on the export controls, CSET report on why china might have a hard time catching up). So especially if timelines are short, we have a healthy lead for now.

  2. ^

     Which risk is bigger, AI misalignment or "bad guys getting AGI first"? cf Holden Karnofsky on the "caution vs. competition" frame

  3. ^

     Or at least, it’s widely believed it has such a 10% chance.

  4. ^
  5. ^

     If this ends up being a big barrier to deploying your model in 50% of worlds, that 50% is enough to make alignment incredibly commercially valuable for you.

  6. ^

     An interesting potential implication not discussed in the main post: if alignment techniques become incredibly commercially valuable/key competitive advantages, will these become trade secrets not shared publicly or with other labs?

New Comment
3 comments, sorted by Click to highlight new comments since:

I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn's recent "If your model is going to sell, it has to be safe" piece. Let me try to unpack this:

On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.

My biggest reservations can be boiled down into two points: 

  1. I don't think that commercial incentives will be enough to motivate people to solve the hardest parts of alignment. Commercial incentives will drive people to make sure their system appears to do what users want, which is very different than having systems that actually do what users want or robustly do what users want even as they become more powerful. Or to put it another way: near-term commercial incentives don't really cause me to put appropriate amounts of attention on things like situational awareness or deceptive alignment. I think commercial incentives will be sufficient to reduce the odds of Bingchat fiascos, but I don't think they'll motivate the kind of alignment research that's trying to handle deception, sharp left turns, or even the most ambitious types of scalable oversight work.
  2. The research that is directly incentivized by commercial interests is least likely to be neglected. I expect the most neglected research to be research that doesn't have any direct commercial benefit. I expect AGI labs will invest a substantial amount of resources to prevent future Bingchat scenarios and other instances of egregious deployment harms. The problem is that I expect many of these approaches (e.g., getting really good at RLHFing your model such that it no longer displays undesirable behaviors) will not generalize to more powerful systems. I think you (and many others) agree with this, but I think the important point here is that the economic incentives will favor RLHFy stuff over stuff that tackles problems that are not as directly commercially incentivized.

As a result, even though I agree with many of your subclaims, I'm still left thinking, "huh, the message I want to spread is not something like "hey, in order to win the race or sell your product, you need to solve alignment." 

But rather something more like "hey, there are some safety problems you'll need to figure out to sell/deploy your product. Cool that you're interested in that stuff. There are other safety problems-- often ones that are more speculative-- that the market is not incentivizing companies to solve. On the margin, I want more attention paid to those problems. And if we just focus on solving the problems that are required for profit/deployment, we will likely fool ourselves into thinking that our systems are safe when they merely appear to be safe, and we may underinvest in understanding/detecting/solving some of the problems that seem most concerning from an x-risk perspective."

I agree with aspects of this critique. However, to steelman Leopold, I think he is not just arguing that demand-driven incentives will drive companies to solve alignment due to consumers wanting safe systems, but rather that, over and above ordinary market forces, constraints imposed by governments, media/public advocacy, and perhaps industry-side standards will make it such that it is ~impossible to release a very powerful, unaligned model. I think this points to a substantial underlying disagreement in your models - Leopold thinks that governments and the public will "wake up" sufficiently quickly to catastrophic risk from AI such that there will be regulatory and PR forces which effectively prevent the release of misaligned models, including evals/ways of detecting misalignment that are more robust than those that might be used by ordinary consumers (which could as you point out likely be fooled by surface-level alignment due to RLHF).

I roughly agree with Akash's comment.

But also some additional points:

  • It's decently likely that it will be pretty easy to get GPT-7 to avoid breaking the law or other egregious issue. As systems get more capable, basic alignment approaches get better at preventing stuff we can measure well. It's plausible that scalable approaches will be needed to avoid egregious and obvious failures which aren't takeover, but it currently seems unlikely. (see also list of lethalities and here and : "The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment."). I think it's possible reality force you to solve the scalable problem directly because otherwise you'll see egregious failures, but more like 25% probability.
  • I agree that it's plausible that scalable/robust alignment will be the binding constraint on deploying AGI (if there isn't otherwise a robust solution by them), and generally becomes more likely as AGI becomes more salient. But the chance seems more like 50/50 to me than a sure bet. Consider gain of function research and lab leaks. These issues become somewhat more salient during covid, but there's still a ton of extremely negative EV gain of function research (IMO)