Clutching a bottle of whiskey in one hand and a shotgun in the other, John scoured the research literature for ideas... He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO. It’s better to stop scaling your transistors and avoid playing with monsters in the first place, instead of devising an elaborate series of monster checks-and-balances and then hoping that the monsters don’t do what monsters are always going to do because if they didn’t do those things, they’d be called dandelions or puppy hugs.
- James Mickens, The Slow Winter
There’s a lot of AI alignment strategies which can reasonably be described as “ask Godzilla to prevent Mega-Godzilla from terrorizing Japan”. Use one AI to oversee another AI. Have two AIs debate each other. Use one maybe-somewhat-aligned AI to help design another. Etc.
Alignment researchers discuss various failure modes of asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. Maybe one of the two ends up much more powerful than the other. Maybe the two make an acausal agreement. Maybe the Nash Equilibrium between Godzilla and Mega-Godzilla just isn’t very good for humans in the first place. Etc. These failure modes are useful for guiding technical research.
… but I worry that talking about the known failure modes misleads people about the strategic viability of Godzilla strategies. It makes people think (whether consciously/intentionally or not) “well, if we could handle these particular failure modes, maybe asking Godzilla to prevent Mega-Godzilla from terrorizing Japan would work”.
What I like about the Godzilla analogy is that it gives a strategic intuition which much better matches the real world. When someone claims that their elaborate clever scheme will allow us to safely summon Godzilla in order to fight Mega-Godzilla, the intuitively-obviously-correct response is “THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO”.
“But look!” says the clever researcher, “My clever scheme handles problems X, Y and Z!”
Response:

“Ok, but what if we had a really good implementation?” asks the clever researcher.
Response:

“Oh come on!” says the clever researcher, “You’re not even taking this seriously! At least say something about how it would fail.”
Don’t worry, we’re going to get to that. But before we do: let’s imagine you’re the Mayor of Tokyo evaluating a proposal to ask Godzilla to fight Mega-Godzilla. Your clever researchers have given you a whole lengthy explanation about how their elaborate and clever safeguards will ensure that this plan does not destroy Tokyo. You are unable to think of any potential problems which they did not address. Should you conclude that asking Godzilla to fight Mega-Godzilla will not result in Tokyo’s destruction?
No. Obviously not. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO. You may not be able to articulate why the answer is obviously “no”, but asking Godzilla to fight Mega-Godzilla will still obviously destroy Tokyo, and your intuitions are right about that even if you are unable to articulate clever arguments.
With that said, let’s talk about why those intuitions are right and why the Godzilla analogy works well.
Brittle Plans and Unknown Unknowns
The basic problem with Godzilla plans is that they’re brittle. The moment anything goes wrong, the plan shatters, and then you’ve got somewhere between one and two giant monsters rampaging around downtown.
And of course, it is a fundamental Law of the universe that nothing ever goes exactly according to plan. Especially when trying to pit two giant monsters against each other. This is the sort of situation where there will definitely be unknown unknowns.
Unknown unknowns + brittle plan = definitely not rising property values in Tokyo.
Do we know what specifically will go wrong? No. Will something go wrong? Very confident yes. And brittleness means that whatever goes wrong, goes very wrong. Errors are not recoverable, when asking Godzilla to fight Mega-Godzilla.
If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it. If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error; it’s the AIs themselves which we depend on to notice problems. If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error; if we had better ways to detect misalignment in the child we’d already have used them on the parent.
The real world will always throw some unexpected problems at our plans. When asking Godzilla to fight Mega-Godzilla, those problems are not recoverable. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.
Meta note: I expect this post to have a lively comment section! Before you leave the twentieth comment saying that maybe Godzilla fighting Mega-Godzilla is better than Mega-Godzilla rampaging unchallenged, maybe check whether somebody else has already written that one, so I don't need to write the same response twenty times. (But definitely do leave that comment if you're the first one, I intentionally kept this essay short on the assumption that lots of discussion would be in the comments.)
I happen to work for a company whose software uses checksums at many layers, and RAID encoding and low-density parity codes at the lowest layers, to detect and recover from hardware failures. It works pretty well, and the company has sold billions of dollars of products of which that is a key component. Also, many (most?) enterprise servers use RAM with error-correcting codes; I think the common configuration allows it to correct single-bit errors and detect double-bit errors, and my company's machines will reset themselves when they detect double-bit errors and other problems that impugn the integrity of their runtime state.
One could quibble about whether "retrieving and querying the data that was written" counts as a "computation", and the extent to which the recovery is achieved through software as opposed to hardware[1], but the source material is a James Mickens comedic rant in any case.
I'd say the important point here is: There is a science to error correction, to building a (more) perfect machine out of imperfect parts, where the solution to unreliable hardware is more of the unreliable hardware, linked up in a clever scheme. They're good enough at it that each successive generation of data storage technology uses hardware with higher error rates. You can make statements like "If failures are uncorrelated, and failures happen every X time units on average per component, and it takes Y time units to detect and recover a failure, and we can recover from up to M failures out of every group of N components, then on average we will have an unrecoverable failure every Z time units"; then you can (a) think about how to arrange it so that Z >> X and (b) think about the dangers of correlated failures.
(The valid complaint that Mickens's character makes is that it would suck if every application needed to weave error correction into every codepath, implement its own RAID, etc. It works much better if the error correction is done by some underlying layer that the application treats as an abstraction—using the abstraction tends to be more complex than pretending errors don't exist (and for noncritical applications the latter is a valid strategy), but not terrible.[2])
With regard to AI. It seems likely that we'll end up making use of potentially-dangerous AIs to do things. If we do, then we'd want powerful safeguards. It seems unlikely that we'd have 100% confidence in any particular safeguard (i.e. unlikely we'd have formal proofs of the safety properties we'd want). Then we'd want to keep adding more safeguards as long as their failure modes weren't totally covered by those of preexisting safeguards (and as long as they're affordable); ideally we'd try to estimate (ideally measure) the chance of failure of each safeguard and how independent they are.
Now, would some of these safeguards be built with the aid of earlier versions of AI? I'd guess yes. This could take a few forms, ranging from "formally-verified direct programs that the earlier AI generated along with their proofs of safety and effectiveness" to "smaller neural nets (cut-down versions of earlier AIs?) that monitor the training of the new AI's net" to "weaker full-blown AIs watching the decisions output by the new full-blown AI". I suspect any of these might be useful.
The Godzilla complaint seems to be about that last thing, having weaker AIs monitor the new AI. Well, um, it seems to me that, if the weaker AIs' output is restricted to the one-bit signal "do we hit the killswitch on the new AI" (and that signal is executed by a machine), then there's no danger of Godzilla destroying Tokyo after it has taken down Mega-Godzilla. (It's like cutting off Godzilla's limbs and tying him to Mega-Godzilla with his jaws open around Mega-Godzilla's neck, so his only available action is biting down.) There is a danger, if this happens a lot, of the subsequent training of the new AI merely teaching it to evade detection by the weaker predecessors; this danger should be taken seriously; but it is also less bad than having Mega-Godzilla kill us the first time it started going rogue; the safeguard is net positive.
Addressing specific claims:
If, among our many precautions, we have AI 1 oversee AI 2 (in this scheme I describe where AI 1's only available action is to output a bit that shuts off AI 2)[3], and AI 2 starts to go rogue, and AI 1 fails in its task, then we're no worse off than if we lacked that particular precaution—either the other precautions save us or they don't. If AI 1 succeeds, then we have at least some chance of being able to debug what went wrong with AI 2, and save ourselves in the future. Using AI 1 as overseer is strictly beneficial, minus implementation costs. (There's a second-order risk of someone overestimating how much it protects us and then choosing to be more reckless, but that's a different kind of thing.)
If "goes wrong" means "it kills us all", then yeah. If "goes wrong" means "it starts acting dangerous but we manage to shut it off", that is potentially recoverable.
The parent could be aligned but incompetent (heh, I think many people know parents like that), while the child is unaligned and more competent. (Why would the programmers create the child if it weren't expected to be more competent than the parent?) It seems easy to imagine alignment checks that the parent would pass while the child would fail.
The checksums are done by the CPU but hardware-accelerated; I think the RAID stuff is clearly software; the LDPC stuff is done by more specialized hardware; the ECC RAM is hardware, but the choice of how to respond to double bit errors is software.
ECC RAM is one example: the application can just assume that memory accesses are correct, and that double-bit errors will cause the operating system to kill the application—so the application itself needs zero lines of code about RAM errors, though now there does need to be monitoring to restart the process and/or fail over to another machine.
Another example is using a durable database, and going from "just increment and decrement the variables in memory" to "create a database transaction, to be attempted / rolled back as a unit". That does require code changes in the application itself.
In practice we'd probably want it to output something more like a number between 0 and 1 representing its worry level, and likely the killswitch goes off if either a single overseer exceeds some danger threshold or multiple overseers exceed some lower threshold.
I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment - for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing bas... (read more)