Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. These labs continue to try to build AGI. I think this is a very bad idea.

The leaders of the big labs are clear that they do not know how to build safe, aligned AGI. The current best plan is to punt the problem to a (different) AI,[1] and hope that can solve it. It seems clearly like a bad idea to try and build AGI when you don’t know how to control it, especially if you readily admit that misaligned AGI could cause extinction.

But there are certain reasons that make trying to build AGI a more reasonable thing to do, for example:

  • They want to build AGI first because they think this is better than if a less safety-focused lab builds it
  • They are worried about multi-polar scenarios
  • They are worried about competition from other nations, specifically from China
  • They think one needs to be able to play with the big models in order to align the bigger models, and there is some other factor which means we will soon have bigger models we need to align

I think the labs should be explicit that they are attempting to build AGI[2], and that this is not safe, but there are specific reasons that cause them to think that this is the best course of action. And if these specific reasons no longer hold then they will stop scaling or attempting to build AGI. They should be clear about what these reasons are. The labs should be explicit about this to the public and to policy makers.

I want a statement like:

We are attempting to build AGI, which is very dangerous and could cause human extinction. We are doing this because of the specific situation we are in.[3] We wish we didn’t have to do this, but given the state of the world, we feel like we have to do this, and that doing this reduces the chance of human extinction. If we were not in this specific situation, then we would stop attempting to build AGI. If we noticed [specific, verifiable observations about the world], then we would strongly consider stopping our attempt to build AGI.

Without statements like this, I think labs should not be surprised if others think they are recklessly trying to build AGI. 

  1. ^

    Either an automated alignment researcher, or something to do with scalable oversight

  2. ^

    Or scale AI systems to levels that are not known to be safe

  3. ^

    It is important that they actually specify what the situation is that forces them to build AGI.

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 6:20 AM

This seems like a good thing for labs to do[1]. I'd go one step earlier and propose that labs make a clear and explicit page (on their website or similar) stating their views on the risk from powerful AI systems. The proposal given in this post seems somewhat more ambitious and costly than the thing I'm proposing in this comment, though the proposal in the post is correspondingly somewhat better.

As far as what a "page stating their views on risk" looks like, I'm imagining something like (numbers are made up):

Views within [AI LAB] vary, but leadership thinks that the risk of AI killing over 1 billion people is around 10%. We think the probability of an AI originating from [AI LAB] killing over 1 billion people in the next 5 years is around 3%. We specifically think that the probability of an AI system mostly autonomously taking over a large chunk of the world is around 10%. The risk of an AI system assisting with terrorism which kills over 1 billion people is around 3%. ...

AI labs often use terms like "AI safety" and "catastrophe". It's probably unclear what problem these terms are pointing at. I'd like it if whenever they said "catastrophe" they say something like:

We do XYZ to reduce the probability of AI caused catastrophe (e.g. the deaths of over 1 billion people, see here for our views on AI risk)

Where here links to the page discussed above.

And similar for using the terms AI safety:

by AI safety, we primarily mean the problem of avoiding AI caused catastrophe (e.g. the deaths of over 1 billion people, see here for our views on AI risk)

I'd consider this ask fulfilled even if this page stated quite optimistic views. At that point, there would be a clear disagreement to highlight.

I'm not sure about how costly these sorts of proposals are (e.g. because it makes customers think you're crazy). Possibly, labs could coordinate to release things like this simultaneously to avoid tragedy of the commons (there might be anti-trust issues with this).


  1. Though I maybe disagree with various specific statements in this post. ↩︎

I'm not sure about how costly these sorts of proposals are (e.g. because it makes customers think you're crazy). Possibly, labs could coordinate to release things like this simultaneously to avoid tragedy of the commons (there might be anti-trust issues with this).

Yep, buy-in from the majority of frontier labs seems pretty important here. If OpenAI went out and said ‘We think that there’s a 10% chance that AGI we develop kills over 1 billion people’, but Meta kept their current stance (along the lines of ‘we think that the AI x-risk discussion is fearmongering and the systems we’re building will be broadly beneficial for humanity’) then I’d guess that OpenAI would lose a ton of business. From the point of view of an enterprise using OpenAI’s products, it can’t help your public image to be using the products of a lab that thinks it has a 10% chance of ending the world - especially if there are other labs offering similar products that don’t carry this burden. In a worst-case scenario, I can imagine that this puts OpenAI directly in the firing line of regulators, whilst Meta gets off far more lightly.

I'm not sure this effect is as strong as one might think. For one, Dario Amodei (CEO of Anthropic) claimed his P(doom) was around 25% (specifically, "the chance that his technology could end human civilisation"). I remember Sam Altman saying something similar, but can't find an exact figure right now. Meanwhile, Yann LeCun (Chief AI Scientist at Meta) maintains approximately the stance you describe. None of this, as far as I'm aware, has led to significant losses for OpenAI or Anthropic.

Is it really the case that making these claims at an institutional level, on a little corner of one's website, is so much stronger than the CEO of one's company espousing these views very publicly in interviews? Intuitively, this seems like it wouldn't make a massive difference.

In a worst-case scenario, I can imagine that this puts OpenAI directly in the firing line of regulators, whilst Meta gets off far more lightly.

I'm interested to know if there's any precedent for this, ie. a company being regulated further because they claimed their industry needed it, while those restrictions weren't applied universally.

OpenAI's Planning for AGI and beyond already writes about why they are building AGI:

Our mission is to ensure that artificial general intelligence—AI systems that are generally smarter than humans—benefits all of humanity.

If AGI is successfully created, this technology could help us elevate humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge that changes the limits of possibility.

AGI has the potential to give everyone incredible new capabilities; we can imagine a world where all of us have access to help with almost any cognitive task, providing a great force multiplier for human ingenuity and creativity.

On the other hand, AGI would also come with serious risk of misuse, drastic accidents, and societal disruption. Because the upside of AGI is so great, we do not believe it is possible or desirable for society to stop its development forever; instead, society and the developers of AGI have to figure out how to get it right. ^A

^A We seem to have been given lots of gifts relative to what we expected earlier: for example, it seems like creating AGI will require huge amounts of compute and thus the world will know who is working on it, it seems like the original conception of hyper-evolved RL agents competing with each other and evolving intelligence in a way we can’t really observe is less likely than it originally seemed, almost no one predicted we’d make this much progress on pre-trained language models that can learn from the collective preferences and output of humanity, etc.

AGI could happen soon or far in the future; the takeoff speed from the initial AGI to more powerful successor systems could be slow or fast. Many of us think the safest quadrant in this two-by-two matrix is short timelines and slow takeoff speeds; shorter timelines seem more amenable to coordination and more likely to lead to a slower takeoff due to less of a compute overhang, and a slower takeoff gives us more time to figure out empirically how to solve the safety problem and how to adapt.

This doesn't include discussion of what would make them decide to stop building AGI, but would you be happy if other labs wrote a similar statement? I'm not sure that AI labs actually have an attitude of "we wish we didn't have to build AGI."

Slightly superhuman AGIs (that are very productive and have serial speed advantage, but not qualitatively more intelligent) are not directly uncontainably dangerous, if they don't get an opportunity to build improved versions. The current mode of scaling seems unlikely to suddenly jump from weak systems to superintelligent systems without visibly passing through this stage, if AGIs at this stage don't immediately escape and improve themselves outside the standard process.

An AGI is an enormous risk, primarily because it can trigger building of misaligned superintelligence, either accidentally or through misuse. Any use of first AGIs that isn't targeting improvement of alignment and security is a kind of misuse in the sense of opportunity cost, since it's burning the timeline to misaligned superintelligence to do something other than addressing its risks.

So one problem with potential regulation (which in its origin and application gets to interact with public statements that companies make about their plans and motives) is that it might be permissive enough to build weak AGIs, without being permissive enough to apply them for developing safety measures against emergence of misaligned superintelligence.

Slightly superhuman AGIs (that are very productive and have serial speed advantage, but not qualitatively more intelligent) are not directly uncontainably dangerous, if they don't get an opportunity to build improved versions. The current mode of scaling seems unlikely to suddenly jump from weak systems to superintelligent systems without visibly passing through this stage, if AGIs at this stage don't immediately escape and improve themselves outside the standard process.

The "if"s seem like possibly big "if"s.  If it's slightly superhuman at computer security, and it has an opportunity to run arbitrary code on some internet-connected machine, then it seems likely that some instance of it, at some point, will find a vulnerability in OpenSSL or something that lets it take over millions of machines on the internet, which it can use to run thousands of instances of itself to brute-force finding more vulnerabilities, and thus bootstrap to taking over as much of the internet as it wants.  Which would then let it work on self-improvement and other goals.

A slight advantage in doing computer security research won't give an entity the ability to take over the internet, by a long shot, especially if it does not have backing by nation state actors. The NSA for instance, as an organisation, has been good at hacking for a long time, and while certainly they can and have done lots of interesting things, they wouldn't be able to take over the world, probably even if they tried and did it with the backing of the full force of the US military.

Indeed, for some computer security problems, even superintelligence might not confer any advantage at all! It's perfectly possible, say, that a superintelligence running on a Matrioshka brain a million years hence will find only modest improvements upon current best attacks against the full AES-128. Intelligence allows one to do math better and, occasionally, to find ways and means that side-step mathematical guarantees, but it does not render the adversary omnipotent; an ASI still has to accept (or negotiate around) physical, mathematical and organizational limits to what it can do. In that sense, a lot of the ASI safety debate I think runs on overpowered adversaries, which will in the long run be bad both in terms of achieving ASI safety (because in an overpowered adversary model, real dangers risk remaining unidentified and unfixed) and in terms of realizing the potential benefits of creating AGI/ASI.

I'm not sure how much of this you already know, but the majority of security vulnerabilities are things like "failure to check the bounds of a buffer" or "failure to sanitize or escape user input before plugging it into a command parser"—dumb mistakes in implementation, in other words.  It's much rarer to find a problem in the cryptographic algorithms (although that happens occasionally, like MD5).  If we look through OpenSSL's already-fixed vulnerabilities list:

https://www.openssl.org/news/vulnerabilities.html

Browser page search says that "buffer over" (as in buffer overflow, overrun, or over-read) appears on the page 24 times (although there's double-counting there, as it tends to appear in the title and once or twice in the description).  You don't need to be a world-class security researcher to find these security holes; it's more a matter of (a) happening to look in the right place, (b) having some knowledge and creativity and intelligence in figuring out how it could be exploited.

[the NSA] wouldn't be able to take over the world

The NSA isn't able to run thousands of copies of itself on machines it hacks into, nor use that to quickly create more powerful instances of itself.  So that part of world domination is clearly out.

But if you mean "the NSA wouldn't be able to do that first step of taking over millions of computers" (which the hypothetical AI would then use to bootstrap)... I disagree.  Do you know about Stuxnet?  Believed to be written by some combination of the NSA and Mossad.  Observe:

Stuxnet attacked Windows systems using an unprecedented four zero-day attacks (plus the CPLINK vulnerability and a vulnerability used by the Conficker worm[63]). It is initially spread using infected removable drives such as USB flash drives,[21][45] which contain Windows shortcut files to initiate executable code.[64] The worm then uses other exploits and techniques such as peer-to-peer remote procedure call (RPC) to infect and update other computers inside private networks that are not directly connected to the Internet.[65][66][67] The number of zero-day exploits used is unusual, as they are highly valued and malware creators do not typically make use of (and thus simultaneously make visible) four different zero-day exploits in the same worm.[23] Amongst these exploits were remote code execution on a computer with Printer Sharing enabled,[68] and the LNK/PIF vulnerability,[69] in which file execution is accomplished when an icon is viewed in Windows Explorer, negating the need for user interaction.[70]

So the NSA and/or Mossad had all these exploits and were sitting on them.  Makes it plausible that, today, they have more that they're sitting on.

The Conficker worm itself apparently did take over an estimated "9 million to 15 million" machines.  So clearly that is doable.

in this talk, @Connor Leahy says that he's talked to the head of "all of these labs" and they all say approximately this.

(i agree it should be more public and clearly legible than being-reported-about-through-connor)

Which are those labs? OpenAI, Anthropic, DeepMind maybe?, what else?

Agree, and in addition to this, I think labs should be really explicit about which part of their motives are supposedly altruistic and which parts are explicitly about avoiding the scenario in which they, personally, don't benefit from AGI. Because "we need to build AGI because it's the only way to advance humanity's technology from now on" is one thing (debatable, IMO, but at least an argument one can make) and "we need to build AGI because if we don't there's a risk that immortality tech won't be here soon enough to apply to our CEO" is another. Never mind the talk about "capturing all value" that Sam Altman has done; "I need to ensure absolute power for myself because I don't feel like trusting anyone else with it" is supervillain talk. And these arguments should be substantiated: if they think there's risks, we want to see real numbers, estimates, and processes and data by which those numbers were calculated. It's one thing to argue something is worth risking the Earth for, which can in some extreme cases be true, but to do so unilaterally without even being transparent about the precise entity and nature of the risks is indefensible.

"we need to build AGI because it's the only way to advance humanity's technology from now on" is one thing (debatable, IMO, but at least an argument one can make)

It's not a sane argument in favor of advancing now vs. later when it's less likely to kill everyone (because there was more time to figure out how to advance safely). The same holds for any argument in the "think of the enormous upside" reference class, the upside isn't going anywhere, it's still there in 20 years.

Instead, there is talk about scaling to 10 times GPT-4 compute in 2024 and many dozens of times GPT-4 compute in 2025 (billions of dollars in compute). Nobody knows what amount of compute is sufficient for AGI, in the sense of capability for mostly autonomous research, especially with some algorithmic improvements. Any significant scaling poses a significant risk of reaching AGI. And once there is AGI, pausing before superintelligence becomes much less plausible than it is now.

It's not a sane argument in favor of advancing now vs. later when it's less likely to kill everyone (because there was more time to figure out how to advance safely). The same holds for any argument in the "think of the enormous upside" reference class, the upside isn't going anywhere, it's still there in 20 years.

Oh, I mean, I do agree. Unless you apply some really severe discount rate to those upsides, there's no way they can outweigh a major risk of extinction (and if you are applying a really severe discount rate because you think you, personally, will die before seeing them, then that's again just being really selfish). But I'm saying it is at least an argument we should try to reckon with at the societal level. Petty private desire for immortality should not even be entertained instead. If you want to risk humanity for the sake of your own life, you're literally taking the sort of insane bet you'd expect a villainous fantasy video game necromancer to. Not only it's evil, it's not even particularly well written evil.

Nobody knows what amount of compute is sufficient for AGI, in the sense of capability for mostly autonomous research, especially with some algorithmic improvements.

This is what I find really puzzling. The human brain, which only crossed the sapience threshold a quarter-million-years of evolution ago, has  synapses, and a presumably a lot of evolved genetically-determined inductive biases. Synapses have very sparse connectivity, so synapse counts should presumably be compared to parameter counts after sparsification, which tends to reduce them by 1-2 orders of magnitude. GPT-4 is believed to have  parameters: it's an MoE model so has some sparsity and some duplication, so call that  for a comparable number. So GPT-4 is showing "sparks of AGI" something like 3 or 4 orders of magnitude before we would expect AGI from a biological parallel. I find that astonishingly low. Bear in mind also that a human brain only needs to implement one human mind, whereas an LLM is trying to learn to simulate every human who's ever written material on the Internet in any high/medium-resource language, a clearly harder problem.

I don't know if this is evidence that AGI is a lot easier than humans make it look, or a lot harder than GPT-4 makes it look? Maybe controlling a real human body is an incredibly compute-intensive task (but then I'm pretty sure that < 90% of the human brain's synapses are devoted to motor control and controlling the internal organs, and more than 10% are used for language/visual processing, reasoning, memory, and executive function). Possibly we're mostly still fine-tuned for something other than being an AGI? Given the implications for timelines, I'd really like to know.

I had a thought. When comparing parameter counts of LLMs to synapse counts, for parity the parameter count of each attention head should be multiplied by the number of locations that it can attend to, or at least its logarithm. That would account for about an order of magnitude of the disparity. So make that 2-3 orders of magnitude. That sounds rather more plausible for sparks of AGI to full AGI.

They are worried about competition from other nations, specifically from China

The People's Republic of China are probably closer than the United Arab Emirates, and definitely closer than the Russians, or the North Koreans, or random Somali warlords. However, as the cost of compute comes down and algorithmic efficiency improvements continue, the number of players with the needed resources will increase (especially if significant parts of the necessary technology are open-sourced). The PRC are also arguably perhaps less likely than some of the others on that list to unintentionally kill us all (they are pretty familiar with using technological means to control the thinking of intelligent agents, after all), and thus more likely to instead actually succeed in enshrining the permanent global rule of the Chinese Communist Party into the terminal goal of a sovereign superintelligence. However, judging by their actions (as opposed to their public ideology), I think they are fairly unlikely to have an AI implement anything resembling actual CEV.

If we (the West, the G20, or the UN) decided to try to enforce an actual extended pause, then access to sufficient compute, talent, and technical know-how would need to be strictly prevented for everyone, I would imagine using the intelligence and military resources of most of the leading nations in the world. Moore's Law, academic publishing, and the open-source community would each make this increasingly difficult. (Bear in mind that the Chinese are currently buying high-end gaming graphics cards and doing some resoldering work on them to turn them into usable AI training compute.) We would thus be building up a large overhang.

I'd really love to hear some serious, practical proposals for how we, as a civilization or international community, could implement an extended pause (preferably without requiring anything in "pivotal act" territory.)