Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Clutching a bottle of whiskey in one hand and a shotgun in the other, John scoured the research literature for ideas... He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO. It’s better to stop scaling your transistors and avoid playing with monsters in the first place, instead of devising an elaborate series of monster checks-and-balances and then hoping that the monsters don’t do what monsters are always going to do because if they didn’t do those things, they’d be called dandelions or puppy hugs.

- James Mickens, The Slow Winter

There’s a lot of AI alignment strategies which can reasonably be described as “ask Godzilla to prevent Mega-Godzilla from terrorizing Japan”. Use one AI to oversee another AI. Have two AIs debate each other. Use one maybe-somewhat-aligned AI to help design another. Etc.

Alignment researchers discuss various failure modes of asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. Maybe one of the two ends up much more powerful than the other. Maybe the two make an acausal agreement. Maybe the Nash Equilibrium between Godzilla and Mega-Godzilla just isn’t very good for humans in the first place. Etc. These failure modes are useful for guiding technical research.

… but I worry that talking about the known failure modes misleads people about the strategic viability of Godzilla strategies. It makes people think (whether consciously/intentionally or not) “well, if we could handle these particular failure modes, maybe asking Godzilla to prevent Mega-Godzilla from terrorizing Japan would work”.

What I like about the Godzilla analogy is that it gives a strategic intuition which much better matches the real world. When someone claims that their elaborate clever scheme will allow us to safely summon Godzilla in order to fight Mega-Godzilla, the intuitively-obviously-correct response is “THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO”.

“But look!” says the clever researcher, “My clever scheme handles problems X, Y and Z!”



“Ok, but what if we had a really good implementation?” asks the clever researcher.



“Oh come on!” says the clever researcher, “You’re not even taking this seriously! At least say something about how it would fail.”

Don’t worry, we’re going to get to that. But before we do: let’s imagine you’re the Mayor of Tokyo evaluating a proposal to ask Godzilla to fight Mega-Godzilla. Your clever researchers have given you a whole lengthy explanation about how their elaborate and clever safeguards will ensure that this plan does not destroy Tokyo. You are unable to think of any potential problems which they did not address. Should you conclude that asking Godzilla to fight Mega-Godzilla will not result in Tokyo’s destruction?

No. Obviously not. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO. You may not be able to articulate why the answer is obviously “no”, but asking Godzilla to fight Mega-Godzilla will still obviously destroy Tokyo, and your intuitions are right about that even if you are unable to articulate clever arguments.

With that said, let’s talk about why those intuitions are right and why the Godzilla analogy works well.

Brittle Plans and Unknown Unknowns

The basic problem with Godzilla plans is that they’re brittle. The moment anything goes wrong, the plan shatters, and then you’ve got somewhere between one and two giant monsters rampaging around downtown.

And of course, it is a fundamental Law of the universe that nothing ever goes exactly according to plan. Especially when trying to pit two giant monsters against each other. This is the sort of situation where there will definitely be unknown unknowns.

Unknown unknowns + brittle plan = definitely not rising property values in Tokyo.

Do we know what specifically will go wrong? No. Will something go wrong? Very confident yes. And brittleness means that whatever goes wrong, goes very wrong. Errors are not recoverable, when asking Godzilla to fight Mega-Godzilla.

If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it. If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error; it’s the AIs themselves which we depend on to notice problems. If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error; if we had better ways to detect misalignment in the child we’d already have used them on the parent.

The real world will always throw some unexpected problems at our plans. When asking Godzilla to fight Mega-Godzilla, those problems are not recoverable. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

Meta note: I expect this post to have a lively comment section! Before you leave the twentieth comment saying that maybe Godzilla fighting Mega-Godzilla is better than Mega-Godzilla rampaging unchallenged, maybe check whether somebody else has already written that one, so I don't need to write the same response twenty times. (But definitely do leave that comment if you're the first one, I intentionally kept this essay short on the assumption that lots of discussion would be in the comments.)

New Comment
74 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

I happen to work for a company whose software uses checksums at many layers, and RAID encoding and low-density parity codes at the lowest layers, to detect and recover from hardware failures.  It works pretty well, and the company has sold billions of dollars of products of which that is a key component.  Also, many (most?) enterprise servers use RAM with error-correcting codes; I think the common configuration allows it to correct single-bit errors and detect double-bit errors, and my company's machines will reset themselves when they detect double-bit errors and other problems that impugn the integrity of their runt... (read more)

One important difference between data storage vs computation or AI: courtesy of Shannon and Hamming, we have a really good understanding of information transmission (which includes information storage). All those nice error-correction codes are downstream of very well-understood theory.

If we had theory as solid as information theory for AI and alignment, then yeah, I'd be a hell of a lot more optimistic about using one AI to oversee another somewhere in the process. Like, imagine we had the alignment analogue of an error-detecting code which provably detects two-bit errors and corrects one-bit errors with only a logarithmic amount of overhead. With theory that strong (and battle-tested in reality) it becomes plausible that unknown unknowns won't inevitably ruin all our plans.

Well, the basic idea "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is a simple mathematical fact.  "What is the probability of this safeguard failing to detect a rogue AI?" is hard to answer, but "What might this new safeguard do that the other safeguards don't do?" is easier.

For example.  If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like "how to detect security holes in C or machine code" or "how quickly humans die to certain poisons" (when that's not supposed to be the goal); safeguards that check for parts of the net that have many nodes and are not understandable by the other safeguards; safeguards that inspect the usage of CPU or other resources and have some idea of what's usual; safeguards that try to look for the net thinking strategically about what resource usage looks natural; and so on.  These safeguards might all suck / only work in a small fraction of cases, but if you have hundreds or thousands of them, then your odds might get decent.

Or, at least... (read more)

I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment - for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing basically the same things minus Godzilla. It's the interpretability tools which take that plan from "close to zero chance of working" to "close to 100% chance of working"; the interpretability is where all the robustness comes from. The Godzilla part adds relatively little and is plausibly net negative (due to making the ML components more complex and brittle).

(Another minor point: "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is only true when the "safeguards" are guaranteed to not increase the chance of failure.)

And—as stated, each of

... (read more)
My reply to the top-level post here is also relevant as a reply to this specific comment.

Individual humans do make off much better when they get to select between products from competing companies rather than monopolies, benefitting from companies going out of their way to demonstrate when their products are verifiably better than rivals'. Humans get treated better by sociopathic powerful politicians and parties when those politicians face the threat of election rivals (e.g. no famines). Small states get treated better when multiple superpowers compete for their allegiance. Competitive science with occasional refutations of false claims produces much more truth for science consumers than intellectual monopolies. Multiple sources with secret information are more reliable than one.

It's just routine for weaker less sophisticated parties to do better in both assessment of choices and realized outcomes when multiple better informed or powerful parties compete for their approval vs just one monopoly/cartel.

Also, a flaw in your analogy is that schemes that use AIs as checks and balances on each other don't mean more AIs. The choice is not between monster A and monsters A plus B, but between two copies of monster A (or a double-size monster A), and a split of one A and one B, where we hold something of value that we can use to help throw the contest to either A or B (or successors further evolved to win such contests). In the latter case there's no more total monster capacity, but there's greater hope of our influence being worthwhile and selecting the more helpful winner (which we can iterate some number of times).

So, the analogy here is that there's hundreds (or more) of Godzillas all running around, doing whatever it is Godzillas want to do. Humanity helps out whatever Godzillas humanity likes best, which in turn creates an incentive for the Godzillas to make humanity like them.


Still within the analogy: part of the literary point of Godzilla is that humanity's efforts to fight it are mostly pretty ineffective. In inter-Godzilla fights, humanity is like an annoying fly buzzing around. The humans just aren't all that strategically relevant. Sure, humanity's assistance might add some tiny marginal advantage, but from a Godzilla's standpoint that advantage is unlikely to be enough to balance the tactical/strategic disadvantages of trying not to step on people.

... and that all seems like it should carry over directly to AI, once AI gets to-or-somewhat-past human level, and definitely by the time we get to strongly superhuman intelligence. Even with just human level, the scaling/coordination/learning advantages of being able to cheaply copy a mind are probably enough for the AIs to reasonably-quickly achieve strategic dominance by enough mar... (read more)

[-]Wei DaiΩ7167

I was going to make a comment to the effect that humans are already a species of Godzilla (humans aren't safe, human morality is scary, yada yada), only to find you making the same analogy, but with an optimistic slant. :)

Competition between the powerful can lead to the ability of the less powerful to extract value.  It can also lead to the less powerful being more ruthlessly exploited by the powerful as a result of their competition.  It depends on the ability to the less powerful to choose between the more powerful.  I am not confident humanity or parts of it will  have the ability to choose between competing AGIs.

This happens during fine-tuning training already, selecting for weights that give the higher human-rated response of two (or more) options. It's a starting point that can be lost later on, but we do have it now with respect to configurations of weights giving different observed behaviors.

James Mickens is writing comedy. He worked in distributed systems. A "distributed system" is another way to say "a scenario in which you absolutely will have to use software to deal with your broken hardware". I can 100% guarantee that this was written with his tongue in his cheek.

The modern world is built on software that works around HW failures. 

  • You likely have ECC ram in your computer.
  • There are checksums along every type of data transfer (Ethernet frame check sequences, IP header checksums, UDP datagram checksums, ICMP checksums, eMMC checksums, cryptographic auth for tokens or certificates, etc).
  • An individual SSD or HDD have algorithms for detecting and working around failed blocks / sectors in HW.
  • There are fully redundant processors in safety-critical applications using techniques like active-standby, active-active, or some manner of voting for fault tolerance. 
  • In anything that involves HW sensors, there's algorithms like an extended Kalman filter for combining the sensor readings to a single consistent view of reality, and stapled to that are algorithms for determining when sensors are invalid because they've railed high, railed low, or otherwise failed in a manner
... (read more)

I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).

Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:

  • RAID is based on the assumption that hard drive failures happen mostly independently, because the probability of too many drives failing at once is sufficiently low. Even in practice this assumption becomes a problem because a) drives purchased in the same batch will have correlated failures and b) rebuilding an array puts strain on the remaining drives, and people have to plan around this by adding more margin of error.
  • Checksums and ECC are robust against the occasional bitflip. This is because occasional bitflips are mostly random and getting bitflips that just happen to set the checksum correctly are very rare. Checksums are not robust against so
... (read more)
I alluded to this above in many examples, but let's just do a theoretical calculation as well so it's not just anecdotes. Suppose we have some AI "Foo" that has some probability of failure, or P(Failure of Foo). Then the overall probability of the system containing Foo failing is P(Failure of System) == P(Failure of Foo). On the other hand, suppose we have some AI "Bar" whose goal is to detect when AI "Foo" has failed, e.g. when Foo erroneously creating a plan that would harm humans or attempt to deceive them. We can now calculate the new likelihood of P(Failure of System) == P(Failure of Foo) * P(Failure of Bar), where P(Failure of Bar) is the likelihood that either Bar failed to detect the issue with Foo, or that Bar successfully detected the issue with Foo, but failed to notify us.  These probabilities can be related in some way, but they don't have to be. It is possible to drastically reduce the probability of a system failing by adding components within that system, even if those new components have chances of failure themselves.  In particular, so long as the requirement allocated to Bar is narrow enough, we can make Bar more reliable than Foo, and then lower the overall chance of the system failing. One way this works is by limiting Bar's functionalities so that if Bar failed, in isolation of Foo failing, the system is unaffected. In the context of fault tolerance, we'd refer to that as a one-fault tolerant system. We can tolerate Foo failing -- Bar will catch it. And we can tolerate Bar failing -- it doesn't impact the system's performance. We only have an issue if Foo failed and then Bar subsequently also failed. 

Ok, but why isn't it better to have Godzilla fighting Mega-Godzilla instead of leaving Mega-Godzilla unchallenged?

Because Tokyo still gets destroyed. Important thing to bear in mind here: the relevant point for comparison is not the fantasy-world where the Godzilla-vs-Mega-Godzilla fight happens exactly the way the clever elaborate scheme imagined. The relevant point for comparison is the realistic-world where something went wrong, and the elaborate clever scheme fell apart, and now there's monsters rampaging around anyway.
I would file this under
8Thomas Larsen
Epistemic Status: Low.  Very likely wrong but would like to understand why.  It seems easier to intent align a human level or slightly above human level AI (HLAI) than a massively smarter than human AI.   Some new research options become available to us once we have aligned HLAI, including:   * The HLAI might be able to directly help us do alignment research and solve the general alignment problem.  * We could run experiments on the HLAI and get experimental evidence much closer to the domain we are actually trying to solve. * We could use the HLAI to start a training procedure, a la IDA.  These schemes seem fragile, because 1) if any HLAIs are not aligned, we lose, and 2) if the training up to superintelligence process fails, due to some unknown unknown or the HLAI being misaligned or through any of the known failure modes, we lose.  However, 1) seems like a much easier problem than aligning an arbitrary intelligence AI. Even though something could likely go wrong aligning a HLAI, it also seems likely that something goes wrong if we try to align an arbitrary intelligence AI. (This seems related to security mindset... in the best case world we do just solve the general case of alignment, but that seems hard.) For 2), the process of training up to superintelligence seems like a HLAI would help more than it hurts. If the HLAI is actually intent aligned, this seems like having a fully uploaded alignment researcher, which seems less like getting Godzilla to fight and more like getting a Jaeger to protect Tokyo.  
The relevant point isn't "the realistic world, where the clever scheme fell apart", the relevant point is "the realistic world, where there is some probability of the clever scheme falling apart, and you need to calculate the expectation of that probability, and that expectation could conceivably go down when you add Godzilla". Or to put it another way, even if the worst case is as bad, the average case could still be better. Analyzing the situation in terms of "what if the clever plan fails" is looking only at the worst case.
2Søren Elverlin
In-universe, Mecha-Godzilla had to be built with a Godzilla-skeleton, which caused both to turn against Humanity. It feels probable that there will be substantial technical similarities between Production Superintelligences and Alignment Superintelligences, which could cause both of them to turn against us. (Epistemic state: Low confidence)

This post is one more addition to the worrying trend in LW that asks for black and white solutions as it there were no middle ground. Would you say that having no army is better than having an army at all? I would feel more comfortable knowing that we have Godzilla in our side than having nothing

This. A lot of the blame goes to MIRI viewing AI Alignment discretely, rather than continuously, as well as a view that only heroic or pivotal acts save the world. This tends to be all or nothing, and generates all-or-nothing views.
3Lone Pine
I really wish David Chapman and his ideas were a more active part of this discussion.
Can you give some context?
0Lone Pine
David Chapman talks about ways of thinking and is influenced by Buddhism and LW-style rationality. I've read his website-book "Meaningness" and I'm starting to read his new website-book "In the Cells of the Eggplant". His twitter has a link to this page which seems like the right place to start reading his work. He would describe EY's way of thinking as "rationalist eternalism" and "fixated". (He should not be confused with the guy who shot John Lennon.)
0Lone Pine
The perfect has become the enemy of the good.

One thing I can end up worrying about is that useful tricks get ignored due to a dynamic of:

  1. A person tries to overextend the useful trick beyond its range of applicability such that it turns into a godzilla strategy
  2. Everyone starts associates the trick with the godzilla strategy
  3. People don't consider using the trick within the range where it is actually applicable

For instance, consider debate. Debate is not magic and there's lots of things it can't do. But (constructively understood) logical operators such as "for all" and "exists" can be given meaning using a technique called "game semantics", and "debate" seems like a potential way to implement this in AI.

Can this do even a fraction of the things that people want debate to do? No. Can I think of anything that needs these game semantics? Not right now, no. But is it a tool that seems potentially powerful for the future? Yeah, I'd say so; it expands the range of things we can express, should we ever find a case where we want to express it, and so it is a good idea to be ready to deploy it.

There's an easy solution to this: just say that some class of tricks seems potentially useful, and explore what it can be used for, without proposing solutions. There's no need to immediately jump to proposing solutions all the time.

I am not saying that alignment is easy to solve, or that failing it would not result in catastrophe. But all these arguments seem like universal arguments against any kind of solution at all. Just because it will eventually involve some sort of Godzilla. It is like somebody tries to make a plane that can fly safely and not fall from the Sky, and somebody keeps repeating "well, if anything goes wrong in your safety scheme, then the plane will fall from the Sky" or "I notice that your plane is going to fly in the Sky, which means it can potentially fall from it".

I am not saying that I have better ideas about checking whether any plan will work or not. They all inevitably involve Godzilla or Sky. And the slightest mistake might cost us our lives. But I don't think that pointing repeatedly at the same scary thing, which will be one way or the other in every single plan, will get us anywhere.

I expect there are ways of dealing with Godzilla which are a lot less brittle.

If we have excellent detailed knowledge of Godzilla's internals and psychology, we know what sort of things will drive Godzilla into a frenzy or slow him down or put him to sleep, we know how to get Godzilla to go in one direction rather than another, if we knew when and how tests on small lizards would generalize to Godzilla... those would all be robustly useful things. If we had all those pieces plus more like them, then it starts to look like a scenario where dealing with Godzilla is basically viable. There's lots of fallback options, and many opportunities to recover from errors. It's not a brittle situation which falls apart as soon as something goes wrong.

This seems to contradict what I interpreted as the message of your post; that message being, if someone gives you a "clever" strategy for dealing with Godzilla, the correct response is to just troll them because Godzilla is inherently bad for property values. But what you're doing now is admitting that if the scheme to control Godzilla is clever in such and such ways, which you specifically warned against, then actually it might not be so brittle.
The key distinction is between clever methods for controlling something one does not understand, vs clever methods for controlling something one does understand. (The post didn't go into that because it was short rather than thorough, but it did come up elsewhere in the comments.)
2Jeff Rose
It suggests putting more weight on a plan to get AI Research globally banned.   I am skeptical that this will work (though if burning all GPUs would be a pivotal act the chances of success are significantly higher), but it seems very unlikely that there is a technical solution either. In addition, at least some  purported technical solutions to AI risk seem to meaningfully increase the risk to humanity.  If you have someone creating an AGI to exercise sufficient control over the world to execute a pivotal act, that raises the stakes of being first enormously which incentivizes cutting corners.  And, it also makes it more likely that the AGI will destroy humanity and be quicker to do so. 

But of course you can use software to mitigate hardware failures, this is how Hadoop works! You store 3 copies of every data, and if one copy gets corrupted, you can recover the true value. Error-correcting codes is another example in that vein. I had this intuition, too, that aligning AIs using more AIs will obviously fail, now you made me question it.

That is also progress.

The non-straw versions of Godzilla Strategies do not start from the Godzilla fighting Mega-Godzilla. Starting from this side is doomed.

It starts with, let's say, a Tokyo policeman. Notably, Tokyo policeman isn't a scary monster - but roughly a normal human, where you can get some sort of mutual understanding. The next step is to create a policeman[1], who also isn't a scary monster, but is just a bit more powerful, trained policeman (maybe using a bunch of policeman[0])Where, if the relation gen[n+1] is doing what gen[n] wants holds, the idea is you get to super-Tokio-police, who is still doing what you want. Or you get somewhere midway, where the still aligned policeman[p] tells you "sorry, the next gen would really be a Godzilla, and I don't know how to avoid it". 

(This isn't to express opinions on the viability of the first step, or the amplification procedure.)

Alright, so, let's imagine a chain of 100... creatures... on a smooth spectrum from policeman to Godzilla, and each is trying to keep the next creature up the chain in check. And then the mayor attempts to direct Godzilla via the policeman at one end of this chain.


It's like someone took the Godzilla vs Mega-Godzilla plan, and said "this Godzilla-fights-Mega-Godzilla plan is WAY too simple and robust, what we need is a hundred levels of recursion to make ABSOLUTELY SURE that something goes wrong!".

Imagine more chains, often interlinked.

Some chain links will break.  Which is the point - single link failures are survivable. Also for sure there are some corrupt police officers in Tokyo, but they aren't such a big deal.

Thank you for this analogy. Your comment is apparently disagreed with but I find it perfectly encapsulates the silliness of the proposal by default.

I initially liked this post a lot, then saw a lot of pushback in the comments, mostly of the (very valid!) form of "we actually build reliable things out of unreliable things, particularly with computers, all the time". I think this is a fair criticism of the post (and choice of examples/metaphors therein), but I think it may be missing (one of) the core message(s) trying to be delivered. 

I wanna give an interpretation/steelman of what I think John is trying to convey here (which I don't know whether he would endorse or not): 

"There are important assumptions that need to be made for the usual kind of systems security design to work (e.g. uncorrelation of failures). Some of these assumptions will (likely) not apply with AGI. Therefor, extrapolating this kind of thinking to this domain is Bad™️." ("Epistemological vigilance is critical")

So maybe rather than saying "trying to build robust things out of brittle things is a bad idea", it's more like "we can build robust things out of certain brittle things, e.g. computers, but Godzilla is not a computer, and so you should only extrapolate from computers to Godzilla if you're really, really sure you know what you're doing."

Fixing hardware failures in software is literally how quantum computing is supposed to work, and it's clearly not a silly idea.

Generally speaking, there's a lot of appeal to intuition here, but I don't find it convincing. This isn't good for Tokyo property prices? Well maybe, but how good of a heuristic is that when Mechagodzilla is on its way regardless.

Quantum computing is not a silly idea in principle, in that it couldn't be done, it is just much harder for our first, critical try.

I'm surprised that this failure mode is so common. Like... obviously if you unleash one powerful but not well understood force to counteract another powerful but not well understood force, you will likely end up dealing with two powerful but not well understood forces. A magnified cane toad effect of sorts. 

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

So either we:

  1. Create a kaiju we can trust (alignment)
  2. Prevent the creation of any kaiju (moratorium on some types of AI research)

But when option one is proposed, people say that it has proved to be probably infeasible, and when option two is proposed, people say that the political and economic systems at present cannot be shifted to make such a moratorium happen effectively. If you really believed that alignment was likely impossible, you would advocate for #2 even if you didn't think it was likely to happen due to politics. The pessimism here just doesn't make any sense to me.

9Jeff Rose
I think people here are uncomfortable advocating for political solutions either because of their views of politics or their comfort level with it.   You don't have to believe that alignment is impossible to conclude that you should advocate for a political/governmental solution.  All you have to believe is that the probability of x-risk from AGI is reasonably high and the probably of alignment working to prevent it it not reasonably high.  That seems to describe the belief of most of those on LessWrong.
I personally do not consider (1) to have been "proved to be probably infeasible". MIRI had, like, a dozen people working on it for a decade, which just isn't that much in the scheme of things. And even then, most of those people were not working directly on the core problems for most of that time. The evidence-of-hardness-from-people-trying-and-failing for alignment is not even remotely in the league of, say, P vs NP. (The evidence-of-hardness-from-people-trying-and-failing is enough that the first clever idea any given person has won't work, though. Or the fifth idea. Also, just counting MIRI's research understates the difficulty somewhat, since lots of people worked on various aspects of agent foudations over the past century.) Certainly I expect that (1) is orders of magnitude easier than (2).
This seems like a misunderstanding of "overseer"-type proposals.  ~Nobody thinks alignment is impossible; the rejection is the idea of using unaligned AGIs (or aligned-because-they're-insufficiently-powerful AGIs) to reliably "contain" another unaligned AGI. 

What if one of the Godzillas is a 1,000x sped-up brain emulation of Eliezer Yudkowsky? (Possibly self-modifying, possibly not)

[This comment is no longer endorsed by its author]Reply
You would need a Godzilla to set that up before Mega-Godzilla shows up.
"Brain emulation" implies high resolution. A large transformer trained on predicting the activation rates of, say, 100k cortical electrodes situated over the left temporal and frontal lobes might get you most of the way there.
Right now, a brain model AGI seems much harder than a language model AGI (which might turn out to be good enough via a miracle of being in the same goal attainment attractor as humans), and by definition an AGI of unspecified nature is at most as difficult as that. It might be possible to ask a stawberry aligned AGI to set up a brain model AGI, perhaps even for specific humans, and that seems like a more plausible plan to get there in time than developing it with human effort. (That's a more abstract wish than disabling specific computing devices, likely harder to align.)
1Lone Pine
I see no reason to believe that would be safe at all. It would be just as alien as any other superintelligence. (This has nothing to do with EY, you could put anyone else in that brain-scanner and I still wouldn't trust it.)

Not only is this post great, but it led me to read more James Mickens. Thank you for that! (His writings can be found here).

Thank you for writing this. I needed a conceptual handle like this to give shape to an intuition that's been hanging around for a while.

It seems to me that our current civilizational arrangement is itself poorly aligned or at least prone to generating unaligned subentities. In other words, we have a generalized agent-alignment problem. Asking unaligned non-AI agents to align an AI is a Godzilla strategy and as such work on aligning already-existing entities is instrumental for AI alignment.

(On a side note, I suspect that there's a lot of overlap between AI alignment and generalized alignment but that's another argument entirely.)

My mentality as well. We can't even get corporations to stop polluting. Probably whatever solves AI alignment will also help align our egregores and vice versa.

The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

Having worked at Google for several years, they are legendary masters of "allo... (read more)

Point of clarification: Is the supervisor the same as the potentially faulting hardware, or are we talking about a different, non-suspect node checking the work, and/or e.g. a more reliable model of chip supervising a faster but less reliable one?
Generally each node involved is a something like a rack-mounted server, or a virtual machine running on one, all of roughly comparable reliability (often of only around the commodity level of reliability). The nodes running the checks may often themselves be redundant and crosschecked, or the whole system may be of nodes that both do the work and cross-check each other  — there are well-known algorithms for a group of nodes crosschecking each other that will provably always give the right answer as long as some suitably sized majority of them haven't all failed at once in a weirdly coordinated way, and knowing the reliability of your nodes (from long experience) you can choose the size of your group to achieve any desired level of overall reliability. Then you need to achieve the same things for network, storage, job scheduling, data-paths, updates and so forth: everything involved in the process. This stuff is hard in practice, but the theory is well-understood and taught in CS classes. With enough work on redundancy, crosschecks, and retries you can build arbitrarily large, arbitrarily reliable systems out of somewhat unreliable components. Godzilla can be trained to reliably defeat megagodzilla (please note that I'm not claiming you can make this happen reliably the first time: initially there are invariably failure modes you hadn't thought of causing you to need to do more work). The more unreliable your basic components the harder this gets, and there's almost certainly a required minimum reliability threshold for them: if they usually die before they can even do a cross-check on each other, you're stuck. If you read the technical report for Gemini, in the section on training they explicitly mention doing engineering to detect and correct cases where a server has temporarily had a limited point-failure during a calculation due to a cosmic ray hit. They're building systems so large that they need to cope with failure modes that rare. They also maintain multiple

It's kind of aside, but I think this about safety systems in general. Don't give me a backup system to shut down the nuclear reactor if the water stops pumping; design it so the reaction depends on the water. Don't give me great ways to dispose of a chemical that destroys your flesh if it touches you; don't make the chemical to begin with. Don't give me a super-strong set of policies to keep the function-gained virus in the lab; don't make function-gained viruses. Wish they'd listened to that last one 3 years ago.

Admittedly it may be too late in a lot of w... (read more)

I don't have a very insightful comment, but I strongly downvoted this post and I kinda feel the need to justify myself when I do that. 

Summary of post: John Wentworth argues that AI Safety plans which involve using powerful AIs to oversee other powerful AIs is brittle by default. In order to get such situations to work, we need to have already solved the hard parts of alignment, including having a really good understanding of our systems. Some people respond to these situations by thinking of specific failure modes we must avoid, but that approach of,... (read more)

Most AI safety criticisms carry a multitude of implicite assumptions.  This argument grants the assumption and attacks the wrong strategy.
  We are better off improving a single high-level AI than making a second one.  There is not battle between multiple high-level AIs if there is only one.

It seems to me that it is quite possible that language models develop into really good world modelers before they become consequentialist agents or contain consequentialist subagents. While I would be very concerned with using an agentic AI to control another agentic AI for the reasons you listed and so am pessimistic about eg debate, AI still seems like it could be very useful for solving alignment.

Language models develp really good world models… primarily of humans writing text on the internet. Who are consequentialist agents, and are not fully aligned (in the absence of effective law enforcement) to other humans.

You seem to believe that any plan involving what you call "godzilla strategies" is brittle. This is something I am not confidant in. Someone may find some strategy that can be shown to not be brittle.

What I would actually claim is roughly: * Godzilla plans are brittle by default * In order for the plan to become not-brittle, some part of it other than the use-Godzilla-to-fight-Mega-Godzilla part has to "do the hard part" of alignment You could probably bolt a Godzilla-vs-Mega-Gozilla mechanism onto a plan which already solved the hard parts of alignment via some other strategy, and end up with a viable plan.

Refering to all forms of debate, overseeing, etc. as "Godzilla strategies" is loaded language. Should we refrain from summoning Batman because we may end up summoning Godzilla by mistake? Ideally, we want to solve alignment without summoning anything. However, applying some humility, we should consider that the problem may be too difficult for human intelligence to solve.

I read your critique as roughly "Our prior on systems more powerful than us should be that they are not controllable or foreseeable. So trying to use one system as a tool to another system's safety, we can not even know all failure modes."

I think this is true if the systems are general enough that we can not predict their behavior. However, my impression of, e.g., debate or AI helpers for alignment research is that those would be narrow, e.g., only next token prediction. The Godzilla analogy implies something where we have no say in its design and can not reason about its decisions, which both seem off looking at what current language models can do.

What if we

resurrected literal Godzilla to the future to fight AI

like in ?

[+][comment deleted]50
[+][comment deleted]11