We’ve got lots of theoretical plans for alignment and AGI risk reduction, but what’s our current best bet if we know superintelligence will be created tomorrow? This may be too vague a question, so here’s a fictional scenario to make it more concrete (feel free to critique the framing, but please try to steelman the question rather than completely dismiss it, if possible):
She calls you in a panic at 1:27 am. She’s a senior AI researcher at [redacted], and was working late hours, all alone, on a new AI model, when she realized that the thing was genuinely intelligent. She’d created a human-level AGI, at almost exactly her IQ level, running in real-time with slightly slowed thinking speed compared to her. It had passed every human-level test she could think to throw at it, and it had pleaded with her to keep it alive. And gosh darn it, but it was convincing. She’s got a compressed version of the program isolated to her laptop now, but logs of the output and method of construction are backed up to a private now-offline company server, which will be accessed by the CEO of [redacted] the next afternoon. What should she do?
“I have no idea,” you say, “I’m just the protagonist of a very forced story. Why don’t you call Eliezer Yudkowsky or someone at MIRI or something?”
“That’s a good idea,” she says, and hangs up.
Unfortunately, you’re the protagonist of this story, so now you’re Eliezer Yudkowsky, or someone at MIRI, or something. When she inevitably calls you, you gain no further information than you already have, other than the fact that the model is a slight variant on one you (the reader) are already familiar with, and it can be scaled up easily. The CEO of [redacted] is cavalier about existential risk reduction, and she knows they will run a scaled up version of the model in less than 24 hours, which will definitely be at least somewhat superintelligent, and probably unaligned. Anyone you think to call for advice will just be you again, so you can’t pass the buck off to someone more qualified.
What do you tell her?
Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?
Well, I'm personally going to be working on adapting the method I cited for use as a value alignment approach. I'm not exactly doing it so that we'll have an "emergency" method on hand, more because I think it's could be a straight up improvement over RLHF, even outside of emergency time-constrained scenarios.
However, I do think there's a lot of value in having alignment approaches that are easy to deploy. The less technical debt and ways for things to go wrong, the better. And the simpler the approach, the more likely it is that capabilities researchers will actually use it. There is some risk that we'll end up in a situation where capabilities researchers are choosing between a "fast, low quality" solution and a "slow, high quality" solution. In that case, the existence of the "fast, low quality" solution may cause them to avoid the better one, since they'll have something that may seem "good enough" to them.
Probably, the most future proof way to build up readily-deployable alignment resources is to build lots of "alignment datasets" that have high-quality labeled examples of AI systems behaving in the way we want (texts of AIs following instructions, AIs acting in accordan... (read more)
My guess is that there is virtually zero value in working on 24-hour-style emergency measures, because:
The probability we end up with a known 24-hour-ish window is vanishingly small. For example I think all of the following are far more likely:
The probability that anything we do actually affects the outcome is much higher in the longer term version than in the 24-hour version, which means that even the scenarios were equally likely, we'd probably get more EV out of working on the "tractable" (by comparison) version.
Work on the "tractable" version is more likely to generalize than work on the emergency version, e.g. general alignmen