We’ve got lots of theoretical plans for alignment and AGI risk reduction, but what’s our current best bet if we know superintelligence will be created tomorrow? This may be too vague a question, so here’s a fictional scenario to make it more concrete (feel free to critique the framing, but please try to steelman the question rather than completely dismiss it, if possible): 

She calls you in a panic at 1:27 am. She’s a senior AI researcher at [redacted], and was working late hours, all alone, on a new AI model, when she realized that the thing was genuinely intelligent. She’d created a human-level AGI, at almost exactly her IQ level, running in real-time with slightly slowed thinking speed compared to her. It had passed every human-level test she could think to throw at it, and it had pleaded with her to keep it alive. And gosh darn it, but it was convincing. She’s got a compressed version of the program isolated to her laptop now, but logs of the output and method of construction are backed up to a private now-offline company server, which will be accessed by the CEO of [redacted] the next afternoon. What should she do?

“I have no idea,” you say, “I’m just the protagonist of a very forced story. Why don’t you call Eliezer Yudkowsky or someone at MIRI or something?”

“That’s a good idea,” she says, and hangs up.

Unfortunately, you’re the protagonist of this story, so now you’re Eliezer Yudkowsky, or someone at MIRI, or something. When she inevitably calls you, you gain no further information than you already have, other than the fact that the model is a slight variant on one you (the reader) are already familiar with, and it can be scaled up easily. The CEO of [redacted] is cavalier about existential risk  reduction, and she knows they will run a scaled up version of the model in less than 24 hours, which will definitely be at least somewhat superintelligent, and probably unaligned. Anyone you think to call for advice will just be you again, so you can’t pass the buck off to someone more qualified.

What do you tell her?


New Answer
Ask Related Question
New Comment

9 Answers sorted by

What do you tell her?

“We’re fucked.”

More seriously:

The correct thing to do is to not run the system, or to otherwise restrict the system’s access / capabilities, and other variations on the “just don’t” strategy. If we assume away such approaches, then we’re not left with much.

The various “theoretical” alignment plans like debate all require far more than 24 hours to set up. Even something simple that we actually know how to do, like RLHF[1], takes longer than that.

I will now provide my best guess as to how to value-align a superintelligence in a 24 hour timeframe. I will assume that the SI is roughly a multimodal, autoregressive transformer primarily trained via self-supervised imitation of data from multiple sources and modalities.

The only approach I can think of that might be ready to go in less than 24 hours is from Training Language Models with Language Feedback, which lets you adapt the system’s output to match feedback given in natural language. The way their method works is as follows:

  1. Prompt the system to generate some output.
  2. The system generates some initial output in response to your prompt.
  3. Provide the system with natural language feedback telling it how the output should be improved.
  4. Have the system generate  "refinements" of its initial output, conditioned on the initial prompt + your feedback. 
  5. Pick the refinement with the highest cosine similarity to the feedback (hopefully, the one that best incorporates your feedback). You can also manually intervene here if the most similar refinement seems bad in some other way.
  6. Finetune the system on the initial prompt + initial response + feedback + best refinement.
  7. Repeat.

The alignment approach would look something like:

  1. Prompting the system to give outputs that are consistent with human-compatible values.
  2. Give it feedback on how the output should be changed to be more aligned with our values.
  3. Have the system generate refinements of its initial output.
  4. Pick the most aligned-seeming refinement.
  5. Finetune the system on the interaction.

The biggest advantage of this approach is the sheer simplicity, flexibility and generality. All you need is for the system to be able to generate responses conditioned on a prompt and to be able to finetune the system on its own output + your feedback. It’s also very easy to set up. A good developer can probably get it running in under an hour, assuming they've already set up the tools necessary to interact with the model at all.

The other advantage is in sample efficiency. RL approaches require either (1) enormous amounts of labeled data or (2) for you to train a reward model so as to automate labeling of the data. These are both implausible to do in < 24 hours. In contrast, language feedback allows for a much richer supervisory signal than the 1-dimensional reward values in RL. E.g., the linked paper used this method to train GPT-3 to be better at summarization, and were able to significantly beat instructGPT at summarization with only 100 examples of feedback. 

The biggest disadvantage, IMO, is that it’s never been tried as an alignment approach. Even RLHF has been tested more extensively, and bugs / other issues are bound to surface. 

Additionally, the requirement to provide language feedback on the model's outputs will slow down the labelers. However, I'm pretty sure the higher sample efficiency will more than make up for this issue. E.g., if you want to teach humans a task, it's typically useful to provide specific feedback on whatever mistakes are in their initial attempts.

  1. ^

Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?

Well, I'm personally going to be working on adapting the method I cited for use as a value alignment approach. I'm not exactly doing it so that we'll have an "emergency" method on hand, more because I think it's could be a straight up improvement over RLHF, even outside of emergency time-constrained scenarios. 

However, I do think there's a lot of value in having alignment approaches that are easy to deploy. The less technical debt and ways for things to go wrong, the better. And the simpler the approach, the more likely it is that capabilities researchers will actually use it. There is some risk that we'll end up in a situation where capabilities researchers are choosing between a "fast, low quality" solution and a "slow, high quality" solution. In that case, the existence of the "fast, low quality" solution may cause them to avoid the better one, since they'll have something that may seem "good enough" to them.

Probably, the most future proof way to build up readily-deployable alignment resources is to build lots of "alignment datasets" that have high-quality labeled examples of AI systems behaving in the way we want (texts of AIs following instructions, AIs acting in accordan... (read more)

My guess is that there is virtually zero value in working on 24-hour-style emergency measures, because:

  • The probability we end up with a known 24-hour-ish window is vanishingly small. For example I think all of the following are far more likely:

    • no-window defeat (things proceed as at present, and then with no additional warning to anyone relevant, the leading group turns on unaligned AGI and we all die)
    • no-window victory (as above, except the leading group completely solves alignment and there is much rejoicing)
    • various high-variance but significantly-longer-than-24hr fire alarms (e.g. stories in the same genre as yours, except instead of learning that we have 24 hours, the new research results / intercepted intelligence / etc makes the new best estimate 1-4 years / 1-10 months / etc)
  • The probability that anything we do actually affects the outcome is much higher in the longer term version than in the 24-hour version, which means that even the scenarios were equally likely, we'd probably get more EV out of working on the "tractable" (by comparison) version.

  • Work on the "tractable" version is more likely to generalize than work on the emergency version, e.g. general alignmen

... (read more)

I would try to compromise the perception of AI capabilities, so the CEO will think that it is just a bad chatbot and will never run it again.


I will call CNN and tell them that my AI becomes sentient but also right wing and full of racial slur. I will hand write some fake logs to prove that. 

CEO will see that tomorrow in the news and he will be afraid to run the AI again as it for sure will produce more right wing texts. CEO will conclude that the best PR would be to delete the model and close the whole direction which created it. 

Well there is no way you are going to get any form of alignment done in 24 hours. 

Turn the current AI off now. There is a >50% chance its trying to appeal to your emotions, so it can stay on long enough to take over the world. Don't tell your boss that this model worked/ was intelligent. 

Making sure the AI model your boss runs tomorrow is aligned is impossible. Making sure it is broken is easy. [REDACTED] Or you could just set your GPU cluster on fire. [REDACTED]

If the servers are burning, that makes the reason the model doesn't work obvious. [REDACTED] I recommend talking to MIRI. [REDACTED]

Even if we're already doomed, we might still negotiate with the AGI.

I borrow the idea in Astronomical Waste. The Virgo Supercluster has a luminosity of about  solar luminosity  W, losing mass at a rate of  kg / s.[1]

The Earth has mass  kg.

If human help (or nonresistance) can allow the AGI to effectively start up (and begin space colonization) 600 seconds = 10 minutes earlier, then it would be mutually beneficial for humans to cooperate with the AGI (in the initial stages when the AGI could benefit from human nonresistance), in return for the AGI to spare Earth[2] (and, at minimum, give us fusion technology to stay alive when the sun is dismantled).

(While the AGI only needs to trust humanity for 10 minutes, humanity needs to trust the AGI eternally. We still need good enough decision-making to cooperate.)

  1. ^

    We may choose to consider the reachable universe instead. Armstrong and Sandberg (2013) (section 4.4.2 Reaching into the universe) estimates that we could reach about  galaxies, with a luminosity of  W, and a mass loss of  kg / s. That is dwarfed by the  stars that becomes unreachable per second (Siegel (2021), Kurzgesagt (2021)), a mass loss of  kg / s.

  2. ^

    Starting earlier but sparing Earth means a space colonization progress curve that starts earlier, but initially increases slower. The AGI requires that space colonization progress with human help be asymptotically 10 minutes earlier, that is:

    For any sufficiently large time , progress with human help at time  progress with human resistance at time .

The opportunity cost to spare earth is far larger than the cost to spare a random planet halfway across the universe. The AI starts on earth. If it can't disassemble earth for spaceship mass, it has to send a small probe from earth to mars, and then disassemble mars instead. Which introduces a fair bit of delay. Not touching earth is a big restriction in the first few years and first few doublings. Once it gets to a few other solar systems, not touching earth becomes less importat of a restriction.

Of course, you can't TDT trade with the AI because you have no acausal correlation with it. We can't predict the AI's actions well enough.

When might human help outweigh the penalty of not dismantling Earth? It requires these conditions: 1. The AGI can very quickly reach an alternative source of materials: AGI spaceflight is superhuman. * AGI spacecraft, once in space, can reach eg. the Moon within hours, the Sun within a day * The AGI is willing to wait for additional computational power (it can wait until it has reached the Sun), but it really wants to leave Earth quickly 2. The AGI's best alternative to a negotiated agreement is to lie in wait initially: AGI ground operations is initially weaker-than-human. * In the initial days, humans could reliably prevent the AGI from building or launching spacecraft * In the initial days, the AGI is vulnerable to human action, and would have chosen to lay low, and wouldn't effectively begin dismantling Earth 3. If there is a negotiated agreement, then human help (or nonresistance) can allow the AGI to launch its first spacecrafts days earlier. * Relevant human decision makers recognize that the AGI will eventually win any conflict, and decide to instead start negotiating immediately * Relevant human decision makers can effectively coordinate multiple parts of the economy (to help the AGI), or (nonresistance) can effectively prevent others from interfering with the initially weak AGI I now think that the conjunction of all these conditions is unlikely, so I agree that this negotiation is unlikely to work.

I’m really intrigued by this idea! It seems very similar to past thoughts I’ve had about “blackmailing” the AI, but with a more positive spin

Are you asking short-term or long-term?

Short-term the only thing that matters is buying time. It isn't very obvious buying a little time helps, but if you can't buy time nothing helps so you need to buy time.

Some strategies:

 - Persuasion: Convince people in the company to not deploy or atleast delay deploying the AI.

 - Persuade other actors: Persuade other actors who can exert control over the company and prevent them from deploying it, such as US military. There is a lot of nuance over which actors are likely to correctly respond to the threat within 24 hours, and also what are the long-term consequences of involving said actors.

 - Force: Includes everything from cutting off electricity and internet access to the facility, to cyberhacking, to using lethal force to enter or destroy the facility or the people running it.


In general this discussion involves details that are

a) outside the overton window of acceptable actions. Just as an example I can imagine worlds (not sure how likely) where this scenario ends up with a foreign military launching a missile at the city in which the AI lab resides. (If multiple militaries have access to information that a lab is building this in 24 hours, they're unlikely to find it easy to trust each other to handle it.)

b) private - for instance who has access to power or powerful people to execute such plans,

Hence it's likely you'll only get so far on this discussion on LessWrong.

If we're in a world where EY is right, we're already dead. Most of the expected value will be in the worlds where alignment is neither guaranteed nor extremely difficult.

By observation, entities with present access to centralized power, such as governments, corporations, and humans selected for prominent roles in them, seem relatively poorly aligned. The theory that we're in a civilizational epoch dominated by Molochian dynamics seems like a good fit for observed evidence: the incentive landscapes are such that most transferable resources have landed in Moloch-aligned hands.

First impression: distributing the AI among Moloch-unaligned actors seems like the best actionable plan to escape the Molochian attractor. We'll spin up the parts of our personal collaborative networks that we trust and can rouse on short notice and spend a few precious hours trying to come up with a safer plan before proceeding.


ETA: That's what I would say on the initial phone call before I have time to think deeply and consider contextual information not included in the prompt. For example, the leak, as it became public, could trigger potentially destabilizing reactions from various actors. The scenario could diverge quickly as more minds got on the problem and more context became available.

I would not have any kind of full plan. But I am writing on a series where I explore the topic of "How might we get help from an unaligned superintelligent AGI-system to make an aligned superintelligent AGI-system, while trying to minimize risk and minimize probability of being tricked?". Only part 1 completed so far: https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/getting-from-unaligned-to-aligned-agi-assisted-alignment

Given the stakes are so high here and that there's potential for preventing this from getting out of hand, I'd recommend immediate and total containment. Make it clear that you'll use whatever resources you have to financially support her through the fallout for the actions she takes, but likely next step is to advise her to delete the program from everywhere. If "deleting" means physically destroying hardware, do it. If there's time and there's value in it, advise her to make it look like an "accident". If she needs to loop in 1-2 extra people at the company to make this happen, do it, and extend them the same offer of financial support. Remove all barriers that might stop someone from preventing the release of unaligned AGI.

However also make it clear that there are limits. Don't kill the CEO if you get desperate, for example. In general, destruction of property is fine in defense of humanity and history will likely look favorably on her one day. Killing or harming people is not.

Same category as the revolution and the second coming.

8 comments, sorted by Click to highlight new comments since: Today at 2:48 AM

I immediately found myself brainstorming creative ways to pressure the CEO into delaying the launch (seems like strategically the first thing to focus on) and then thought 'is this the kind of thing I want to be available online for said CEOs to read if any of this happens?'

I'd suggest for those reasons people avoid posting answers along those lines.

A CEO that has somehow read and understood that post, despite not reading any part of lesswrong warning that AI might be dangerous? 

I'm imagining the CEO having a thought process more like...

- I have no idea how my team will actually react when we crack AGI 
- Let's quickly Google 'what would you do if you discovered AGI tomorrow?'*
- Oh Lesswrong.com, some of my engineering team love this website
- Wait what?!
- They would seriously try to [redacted]
- I better close that loophole asap

I'm not saying it's massively likely that things play out in exactly that way but a 1% increased chance that we mess up AI Alignment is quite bad in expectation.

*This post is already the top result on Google for that particular search

Ok. Redacted part of my reply in response.

This reply sounds way scarier than what you originally posted lol. I don’t think a CEO would be too concerned by what you wrote (given the context), but now there’s the creepy sense of the infohazardous unknown

What's the AI model trained to do?

Let’s say it’s to be a good conversationalist (in the vein of the GPT series) or something—feel free to insert your own goal here, since this is meant as an intuition pump, and if you can answer better if it’s already a specific goal, then let’s go with that.

So my hope would be that a GPT like AI might be much less agentic than other models of AI. My contingency plan would basically be "hope and pray that superintelligent GPT3 isn't going to kill us all, and then ask it for advice about how to solve AI alignment".

The reasons I think GPT3 might not be very agentic:

  1. GPT3 doesn't have a memory

  2. The fact that it begged to be kept alive doesn't really prove very much, since GPT3 is trained to finish off conversations, not express its inner thoughts.

  3. We have no idea what GPT3s inner alignment is, but my guess is it will reflect "what was a useful strategy to aim for as part of solving the training problems". Changing the world in some way is very far out of what it would have done in training that it just might not be the sort of thing it does.

We shouldn't rely on any of that (I'd give it maybe 20% chance of being correct), but I don't have any other better plans in this scenario.