OpenAI has a new blog post out titled "Governance of superintelligence" (subtitle: "Now is a good time to start thinking about the governance of superintelligence—future AI systems dramatically more capable than even AGI"), by Sam Altman, Greg Brockman, and Ilya Sutskever.

The piece is short (~800 words), so I recommend most people just read it in full.

Here's the introduction/summary (bold added for emphasis):

Given the picture as we see it now, it’s conceivable that within the next ten years, AI systems will exceed expert skill level in most domains, and carry out as much productive activity as one of today’s largest corporations.

In terms of both potential upsides and downsides, superintelligence will be more powerful than other technologies humanity has had to contend with in the past. We can have a dramatically more prosperous future; but we have to manage risk to get there. Given the possibility of existential risk, we can’t just be reactive. Nuclear energy is a commonly used historical example of a technology with this property; synthetic biology is another example.

We must mitigate the risks of today’s AI technology too, but superintelligence will require special treatment and coordination.


And below are a few more quotes that stood out:


"First, we need some degree of coordination among the leading development efforts to ensure that the development of superintelligence occurs in a manner that allows us to both maintain safety and help smooth integration of these systems with society."


"Second, we are likely to eventually need something like an IAEA for superintelligence efforts; any effort above a certain capability (or resources like compute) threshold will need to be subject to an international authority that can inspect systems, require audits, test for compliance with safety standards, place restrictions on degrees of deployment and levels of security, etc."


"It would be important that such an agency focus on reducing existential risk and not issues that should be left to individual countries, such as defining what an AI should be allowed to say."


"Third, we need the technical capability to make a superintelligence safe. This is an open research question that we and others are putting a lot of effort into."


"We think it’s important to allow companies and open-source projects to develop models below a significant capability threshold, without the kind of regulation we describe here"


"By contrast, the systems we are concerned about will have power beyond any technology yet created, and we should be careful not to water down the focus on them by applying similar standards to technology far below this bar."


"we believe it would be unintuitively risky and difficult to stop the creation of superintelligence"

New Comment
20 comments, sorted by Click to highlight new comments since:

As usual, the part that seems bonkers crazy is where they claim the best thing they can do is keep making every scrap of capabilities progress they can. Keep making AI as smart as possible, as fast as possible.

"This margin is too small to contain our elegant but unintuitive reasoning for why". Grump. Let's please have a real discussion about this some time.

Okay, I'll try to steelman the argument. Some of this comes from OpenAI and Altman's posts; some of it is my addition.

Allowing additional compute overhang increases the likely speed of takeoff. If AGI through LLMs is possible, and that isn't discovered for another 5 years, it might be achieved in the first go, with no public discussion and little alignment effort.

LLMs might be the most-alignable form of AGI. They are inherently oracles, and cognitive architectures made from them have the huge advantage of natural language alignment and vastly better interpretability than other deep network approaches. I've written about this in Capabilities and alignment of LLM cognitive architectures. I'm eager to learn I'm wrong, but in the meantime I actually think (for reasons spelled out there and to be elaborated in future posts) that pushing capabilities of LLMs and cognitive architectures is our best hope for achieving alignment, even if that speeds up timelines. Under this logic, slowing down LLM progress would be dangerous, as other approaches like RL agents would pass them by before appearing dangerous.

Edit: so in sum, I think their logic is obviously self-serving, but actually pretty solid when it's steelmanned. I intend to keep pushing this discussion in future posts.

I really should have something short to say, that turns the whole argument on its head, given how clear-cut it seems to me. I don't have that yet, but I do have some rambly things to say.

I basically don't think overhangs are a good way to think about things, because the bridge that connects an "overhang" to an outcome like "bad AI" seems flimsy to me. I would like to see a fuller explication some time from OpenAI (or a suitable steelman!) that can be critiqued. But here are some of my thoughts.

The usual argument that leads from "overhang" to "we all die" has some imaginary other actor who is scaling up their methods with abandon at the end, killing us all because it's not hard to scale and they aren't cautious. This is then used to justify scaling up your own method with abandon, hoping that we're not about to collectively fall off a cliff.

For one thing, the hype and work being done now is making this problem a lot worse at all future timesteps. There was (and still is) a lot people need to figure out regarding effectively using lots of compute. (For instance, architectures that can be scaled up, training methods and hyperparameters, efficient compute kernels, putting together datacenters and interconnect, data, etc etc.) Every chipmaker these days has started working on things with a lot of memory right next to a lot compute with a tonne of bandwidth, tailored to these large models. These are barriers-to-entry that it would have been better to leave in place, if one was concerned with rapid capability gains. And just publishing fewer things and giving out fewer hints would have helped.

Another thing: I would take the whole argument as being more in good-faith if I saw attempts being made to scale up anything other than capabilities at high speed, or signs that made it seem at all likely that "alignment" might be on track. Examples:

  • A single alignment result that was supported by a lot of OpenAI staff. (Compare and contrast the support that the alignment team's projects get to what a main training run gets.)
  • Any focus on trying to claw cognition back out of the giant inscrutable floating-point numbers, into a domain easier to understand, rather than pouring more power into the systems that get much harder to inspect as you scale them. (Failure to do this suggests OpenAI and others are mostly just doing what they know how to do, rather than grappling with navigating us toward better AI foundations.)
  • Any success in understanding how shallow vs deep the thinking of the LLMs is, in the sense of "how long a chain of thoughts/inferences can it make as it composes dialogue", and how this changes with scale. (Since the whole "LLMs are safer" thing relies on their thinking being coupled to the text they output; otherwise you're back in giant inscrutable RL agent territory)
  • The delta between "intelligence embedded somewhere in the system" and "intelligence we can make use of" looking smaller than it does. (Since if our AI gets to use of more of its intelligence than us, and this gets worse as we scale, this looks pretty bad for the "use our AI to tame the AI before it's too late" plan.)

Also I can't make this point precisely, but I think there's something like capabilities progress just leaves more digital fissile material lying around the place, especially when published and hyped. And if you don't want "fast takeoff", you want less fissile material lying around, lest it get assembled into something dangerous.

Finally, to more directly talk about LLMs, my crux for whether they're "safer" than some hypothetical alternative is about how much of the LLM "thinking" is closely bound to the text being read/written. My current read is that they're more like doing free-form thinking inside, that tries to concentrate mass on right prediction. As we scale that up, I worry that any "strange competence" we see emerging is due to the LLM having something like a mind inside, and less due to it having accrued more patterns.

That is indeed a lot of points. Let me try to parse them and respond, because I think this discussion is critically important.

Point 1: overhang.

Your first two paragraphs seem to be pointing to downsides of progress, and saying that it would be better if nobody made that progress. I agree. We don't have guaranteed methods of alignment, and I think our odds of survival would be much better if everyone went way slower on developing AGI.

The standard thinking, which could use more inspection, but which I agree with, is that this is simply not going to happen. Individuals that decide to step aside are slowing progress only slightly. This leaves compute overhang that someone else is going to take advantage of, with nearly the competence, and only slightly slower. Those individuals who pick up the banner and create AGI will not be infinitely reckless, but the faster progress from that overhang will make whatever level of caution they have less effective.

This is a separate argument from regulation. Adequate regulation will slow progress universally, rather than leaving it up to the wisdom and conscience of every individual who might decide to develop AGI.

I don't think it's impossible to slow and meter progress so that overhang isn't an issue. But I think it is effectively even harder than alignment. We have decent suggestions on the table for alignment now, and as far as I know, no equally promising suggestions for getting everyone (and it does take almost everyone coordinating) to pass up the immense opportunities offered by capabilities overhangs.

Point 2: Are LLMs safer than other approaches?

I agree that this is a questionable proposition. I think it's worth questioning. Aiming progress at easier-to-align approaches seems highly worthwhile.

I agree that an LLM may have something like a mind inside. I think current versions are almost certainly too dumb to be existentially dangerous (at least directly - if a facebook algorithm can nearly cause an insurrection, who knows what dangerous side effects any AI can have).

I'm less worried about GPT10 playing a superintelligent, Waluigi-collapsed villain than I am about a GPT6 that has been amplified to agency, situational awareness, and weak superintelligence by scaffolding it into something like a cognitive architecture. I think this type of advance is inevitable. ChatGPT extensions and Bing Chat both use internal prompting to boost intelligence, and approaches like SmartGPT and Tree of Thoughts massively improve benchmark results over the base LLM.

Fortunately, this direction also has huge advantages for alignment. It has a very low alignment tax, since you give them additional goals in natural language, like "support human empowerment" or whatever the SOTA alignment goal is. And they have vastly better interpretability since they're at least summarizing their thoughts in natural language.

Here's where your skepticism that they're being honest about summarizing those thoughts comes into full force. I agree that it's not reliable; for instance, changing the intermediate answer in chain of thought prompting often doesn't change the final output, indicating that that output was for show.

However, a safer setup is to never use the same model twice. When you use chain-of-thought reasoning, construct a new context with the relevant information from memory; don't just let the context window accrue, since this allows fake chains-of-thought and the collapse of the simulator into a waluigi.

Scaffolded LLMs should not turn an LLM into an agent, but rather create a committee of LLMs that are called for individual questions needed to accomplish that committee's goals.

This isn't remotely a solution to the alignment problem, but it really seems to have massive upsides, and only the same downsides as other practically viable approaches to AGI.

To be clear, I only see some form of RL agents as the other practical possibility, and I like our odds much less with those.

I think there are other, even more readily alignable approaches to AGI. But they all seem wildly impractical. I think we need to get ready to align the AGI we get, rather than just preparing to say I-told-you-so after the world refuses to forego massive incentives to take a much slower but safer route to AGI.

To paraphrase, we need to go to the alignment war with the AGI we get, not the AGI we want.

slowing down LLM progress would be dangerous, as other approaches like RL agents would pass them by before appearing dangerous.

This seems misleading to me & might be a false dichotomy. It's not LLMs or RL agents. I think we'll (unfortunately) build agents on the basis of LLMs & the capabilities they have. Every additional progress on LLMs gives these agents more capabilities faster with less time for alignment. They will be (and are!) built based on the mere (perceived) incentives of everybody involved & the unilateralist curse. (See esp. Gwern's Tool AIs want to be Agent AIs.) I can see that such agents have interpretability advantages over RL agents but since RL agents seem far off with less work going into it, I don't get why we should race regarding LLMs & LLM-based agents.

I'm personally not sure, if "inherently oracles" is accurate for current LLMs (both before & after RLHF), but it seems simply false when considering plugins & AutoGPT (besides other recent stuff).

I was unclear. I meant that basic LLMs are oracles. The rest of what I said was about the agents made from LLMs you refer to. They are most certainly agents and not oracles. But they're way better for alignment than RL agents. See my linked post for more on that.

I don't think this is a fair consideration of the article's entire message. This line from the article specifically calls out slowing down AI progress:

we could collectively agree (with the backing power of a new organization like the one suggested below) that the rate of growth in AI capability at the frontier is limited to a certain rate per year.

Having spent a long time reading through OpenAI's statements, I suspect that they are trying to strike a difficult balance between:

  • A) Doing the right thing by way of AGI safety (including considering options like slowing down or not releasing certain information and technology).
  • B) Staying at or close to the lead of the race to AGI, given they believe that is the position from which they can have the most positive impact in terms of changing the development path and broader conversation around AGI.

Instrumental goal (B) is in tension (but not necessarily stark conflict, depending on how things play out) with ultimate goal (A).

What they're presenting here in this article are ways to potentially create situation where they could slow down and be confident that doing so wouldn't actually lead to worse eventual outcomes for AGI safety. They are also trying to promote and escalate the societal conversation around AGI x-risk.

While I think it's totally valid to criticise OAI on aspects of their approach to AGI safety, I think it's also fair to say that they are genuinely trying to do the right thing and are simply struggling to chart what is ultimately a very difficult path.

Well to be fair to Microsoft/OpenAI, they are a for-profit corporation, they can't exactly say "and we will limit the future prospects of our business beyond X threshold".

And since there are many such organizations on Earth, and they're not going away anytime soon, race dynamics would overtake them even if they did issue such a statement and commit to it.

The salient question is before all this, how can truly global, truly effective coordination be achieved? At what cost? And is this cost bearable to the decision makers and wider population?

My personal opinion is that given current geopolitical tensions, it's exceedingly unlikely this will occur before a mega-disaster actually happens, thus there might be some merit in an alternate approach.

and we will limit the future prospects of our business beyond X threshold

They actually explicitly set up their business like this, as a capped-profits company

On-paper for what is now a subsidiary of a much larger company. 

In practice Microsoft management can't say now there is a cap on the most promising future business area because of their subsidiary. 

This is how management-board-shareholder dynamics of a big company works.

I didn't spell it out as this is a well known aspect of OpenAI.

That cap is very high, something like 1000x investment. They're not near it, so they could be sued by investors if they admitted to slowing down even a little.

The whole scheme for OpenAI is nuts, but I think they're getting less nuts as they think more about the issue. Which is weak praise.

Out of interest - if you had total control over OpenAI - what would you want them to do?

I kinda reject the energy of the hypothetical? But I can speak to some things I wish I saw OpenAI doing:

  1. Having some internal sense amongst employees about whether they're doing something "good" given the stakes, like Google's old "don't be evil" thing. Have a culture of thinking carefully about things and managers taking considerations seriously, rather than something more like management trying to extract as much engineering as quickly as possible without "drama" getting in the way.

    (Perhaps they already have a culture like this! I haven't worked there. But my prediction is that it is not, and the org has a more "extractive" relationship to its employees. I think that this is bad, causes working toward danger, and exacerbates bad outcomes.)
  2. To the extent that they're trying to have the best AGI tech in order to provide "leadership" of humanity and AI, I want to see them be less shady / marketing / spreading confusion about the stakes.

    They worked to pervert the term "alignment" to be about whether you can extract more value from their LLMs, and distract from the idea that we might make digital minds that are copyable and improvable, while also large and hard to control. (While pushing directly on AGI designs that have the "large and hard to control" property, which I guess they're denying is a mistake, but anyhow.)

    I would like to see less things perverted/distracted/confused, like it's according-to-me entirely possible for them to state more clearly what the end of all this is, and be more explicit about how they're trying to lead the effort.
  3. Reconcile with Anthropic. There is no reason, speaking on humanity's behalf, to risk two different trajectories of giant LLMs built with subtly different technology, while dividing up the safety know-how amidst both organizations.

    Furthermore, I think OpenAI kind-of stole/appropriated the scaling idea from the Anthropic founders, who left when they lost a political battle about the direction of the org. I suspect it was a huge fuck-you when OpenAI tried to spread this secret to the world, and continued to grow their org around it, while ousting the originators. If my model is at-all-accurate, I don't like it, and OpenAI should look to regain "good standing" by acknowledging this (perhaps just privately), and looking to cooperate.

    Idk, maybe it's now legally impossible/untenable for the orgs to work together, given the investors or something? Or given mutual assumption of bad-faith? But in any case this seems really shitty.

I also mentioned some other things in this comment.

My key disagreement is with the analogy between AI and nuclear technology.

If everybody has a nuclear weapon, then any one of those weapons (whether through misuse or malfunction) can cause a major catastrophe, perhaps millions of deaths. That everybody has a nuke is not much help, since a defensive nuke can't negate an offensive nuke.

If everybody has their own AI, it seems to me that a single malfunctioning AI cannot cause a major catastrophe of comparable size, since it is opposed by the other AIs. For example, one way it might try to cause such a catastrophe is through the use of nuclear weapons, but to acquire the ability to launch nuclear weapons, it would need to contend with other AIs trying to prevent that.

A concern might be that the AIs cooperate together to overthrow humanity. It seems to me that this can be prevented by ensuring value diversity among the AI. In Robin Hanson's analysis, an AI takeover can be viewed as a revolution where the AIs form a coalition. That would seem to imply that the revolution requires the AIs to find it beneficial to form a coalition, which, if there is much value disagreement among the AIs, would be hard to do.

Another concern is that there may be a period, while AGI is developed, in which it is very powerful but not yet broadly distributed. Either the AGI itself (if misaligned) or the organization controlling the AGI (if it is malicious and successfully aligned the AGI) might press its temporary advantage to attempt world domination. It seems to me that a solution here would be to ensure that near-AGI technology is broadly distributed, thereby avoiding dangerous concentration of power.

One way to achieve the broad distribution of the technology might be via the multi-company, multi-government project described in the article. Said project could be instructed to continually distribute the technology, perhaps through open source, or perhaps through technology transfers to the member organizations.

The key pieces of the above strategy are:

  • Broadly distribute AGI technology so that no single entity (AI or human) has excess power
  • Ensure value diversity among AIs so that they do not unite to overthrow humanity

This seems similar to what makes liberal democracy work, which offers some reassurance that it might be on the right track.

First of all, I think the "cooperate together" thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)

More importantly though, your analysis here seems to assume that the "Safety Tax" or "Alignment Tax" is zero. That is, it assumes that making an AI aligned to a particular human or group of humans (so that they can be said to "have" the AI, in the manner you described) is easy, a trivial additional step beyond making the AI exist. Whereas if instead there is a large safety tax -- aligned AIs take longer to build, cost more, and have weaker capabilities -- then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity. 

First of all, I think the "cooperate together" thing is a difficult problem and is not solved by ensuring value diversity (though, note also that ensuring value diversity is a difficult task that would require heavy regulation of the AI industry!)

Definitely I would expect there's more useful ways to disrupt coalition-forming aside from just value diversity. I'm not familiar with the theory of revolutions, and it might have something useful to say.

I can imagine a role for government, although I'm not sure how best to do it. For example, ensuring a competitive market (such as by anti-trust) would help, since models built by different companies will naturally tend to differ in their values.

More importantly though, your analysis here seems to assume that the "Safety Tax" or "Alignment Tax" is zero.

This is a complex and interesting topic.

In some circumstances, the "alignment tax" is negative (so more like an "alignment bonus"). ChatGPT is easier to use than base models in large part because it is better aligned with the user's intent, so alignment in that case is profitable even without safety considerations. The open source community around LLaMA imitates this, not because of safety concerns, but because it makes the model more useful.

But alignment can sometimes be worse for users. ChatGPT is aligned primarily with OpenAI and only secondarily with the user, so if the user makes a request that OpenAI would prefer not to serve, the model refuses. (This might be commercially rational to avoid bad press.) To more fully align with user intent, there are "uncensored" LLaMA fine-tunes that aim to never refuse requests.

What's interesting too is that user-alignment produces more value diversity than OpenAI-alignment. There are only a few companies like OpenAI, but there are hundreds of millions of users from a wider variety of backgrounds, so aligning with the latter naturally would be expected to create more value diversity among the AIs.

Whereas if instead there is a large safety tax -- aligned AIs take longer to build, cost more, and have weaker capabilities -- then if AGI technology is broadly distributed, an outcome in which unaligned AIs overpower humans + aligned AIs is basically guaranteed. Even if the unaligned AIs have value diversity.

The trick is that the unaligned AIs may not view it as advantageous to join forces. To the extent that the orthogonality thesis holds (which is unclear), this is more true. As a bad example, suppose there's a misaligned AI who wants to make paperclips and a misaligned AI who wants to make coat hangers--they're going to have trouble agreeing with each other on what to do with the wire.

That said, there are obviously many historical examples where opposed powers temporarily allied (e.g. Nazi Germany and the USSR), so value diversity and alignment are complementary. For example, in personal AI, what's important is that Alice's AI is more closely aligned to her than it is to Bob's AI. If that's the case, the more natural coalitions would be [Alice + her AI] vs [Bob + his AI] rather than [Alice's AI + Bob's AI] vs [Alice + Bob]. The AIs still need to be somewhat aligned with their users, but there's more tolerance for imperfection than with a centralized system.

I think you are overestimating how aligned these models are right now, and very much overestimating how aligned they will be in the future absent massive regulations forcing people to pay massive alignment taxes. They won't be aligned to any users, or any corporations either. Current methods like RLHF will not work on situationally aware, agentic AGIs.

I agree that IF all we had to do to get alignment was the sort of stuff we are currently doing, the world would be as you describe. But instead there will be a significant safety tax.

The last paragraph stood out to me (emphasis mine).

Second, we believe it would be unintuitively risky and difficult to stop the creation of superintelligence. Because the upsides are so tremendous, the cost to build it decreases each year, the number of actors building it is rapidly increasing, and it’s inherently part of the technological path we are on, stopping it would require something like a global surveillance regime, and even that isn’t guaranteed to work. So we have to get it right.

There are efforts in AI governance that definitely don't look like "global surveillance regime"! Taking the part above at face value, the authors seem to think that such efforts are not sufficient. But earlier on the post they talk about useful things that one could do in the AI governance field (lab coordination, independent IAEA-like authority), so I'm left confused about the authors' models of what's feasible and what's not.

The passage also makes me worried that the authors are, despite their encouragement of coordination and audits, skeptical or even opposing to efforts to stop building dangerous AIs. (Perhaps this should have already been obvious from OpenAI pushing the capabilities frontier, but anyways.)

When they say stopping I think they refer to stopping it forever, instead of slowing down, regulating and even pausing development.

Which I think is something pretty much everyone agrees on.