One of the main principles of engineering safety is multilevel defence. When a nuclear bomb accidentally fell from the sky in the US, 3 of 4 defence levels failed. The last one prevented the nuclear explosion:

Multilevel defence is used a lot in the nuclear industry and includes different systems of passive and active safety, starting from the use of delayed neutrons for the reaction activation and up to control rods, containment building and exclusion zones.

Here, I present a look at the AI safety from the point of view of multilevel defence. This is mainly based on two of my yet unpublished articles: “Global and local solutions to AI safety” and “Catching treacherous turn: multilevel AI containment system”.

The special property of the multilevel defence, in the case of AI, is that the biggest defence comes from only the first level, which is AI alignment. Other levels have progressively smaller chances to provide any protection, as the power of self-improving AI will grow after it will break of each next level. So we may ignore all levels after AI alignment, but, oh Houston, we have a problem: based on the current speed of AI development, it seems that powerful and dangerous AI could appear within several years, but AI safety theory needs several decades to be created.

The map is intended to demonstrate a general classification principle of the defence levels in AI safety, but not to list all known ideas on the topic. I marked in “yellow” boxes, which are part of the plan of MIRI according to my understanding.

I also add my personal probability estimates as to whether each level will work (under the condition that AI risks are the only global risk, and previous levels have failed).

The principles of the construction of the map are similar to my “plan of x-risks prevention” map and my “immortality map”, which are also based around the idea of the multilevel defence.


New to LessWrong?

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 9:26 AM

CC from facebook: Rob Bensinger The boxes labeled "MIRI's strategy" in the chart above look mostly wrong to me at a glance. You can find a good summary of MIRI's thoughts about strategy here:

I'd also want to distinguish organizational strategy from strategic outlook. MIRI's organizational strategy is separate from the list of things we want some group in the world to do. (E.g., MIRI isn't an AGI shop.) There are also things like "an AI winter occurs" that can be helpful if they occur (to give more time for safety work), without being good intervention points for anyone.

Me: I will update based on your recent document.

The new version of the map which is updated based on the comment by Rob Bensinger about MIRI 2017 strategy is up.

The difference between explicit organisational strategy and the strategic outlook is presented in different colors of the boxes.

Find the full version in the pdf:

I really like your map as it starts to give me a framework for dealing with the whole issue. The percentages of success are depressing low.

Under 0 Preliminary measures you don't put your personal estimate of % chance of success. You also put in destruction of AI labs. Is not destruction/taxation of semiconductor fabs an easier target (Wikipedia has a list)? I think they are also so expensive they are harder to hide.

ps spelling error desireble -> desirable in bottom right yellow legend. and Prelimnary=> preliminary

Thanks for error pointing. I don't think that anyone actually plotting nuclear strikes on AI labs. However, Putin or Kim could think that it is their only chance to preserve power in AI age. Anyways, it can't be regarded as success. It is either a global catastrophe or a short delay.

But I will think how to add estimation to successfull AI ban.

A bit gloomy as only global catastrophe or delayed catastrophe.

I thought we had 40 years but with Elon Musk talking about 7-8 years for an AGI and with the recent 4 hour training to get to [world chess supremacy][chess] I am not so sure. So I think we need to buy some time. Even if you can't destroy the semiconductor fabs you could still increase taxes. this could be marketed as helping to pay for societies dislocation while we undergo job losses.

I also think that there is only several years until dangerous AI.

See my presentation about it:

However, I think that war will only increase extinction risks - and even AI risks, as it will increase arms race and stops ethical thinking. Also, a strike on Silicon valley will kill best minds in AI safety, but some obscure chineese labs will continue to exist.

Thank you for that reference. I hadn't seen a quantification of the Bitcoin computer capacity which was interesting and high.

This is probably as good a place to talk about this it as any-

I get a sense that most people who "understand" the alignment problem think boxing has almost no chance of working. While I completely agree that it is unsafe, that relying on it would be a horrible idea, and even that it is unlikely to work, I've never seen anything that has convinced me that it is in the 1% area. Usually, discussion (such as in Superintelligence) only goes far enough to point out why naive boxing will fail, and it is then assumed that the entire approach is a dead end, which might be a reasonable assumption but I don't find it obvious.

I'll briefly describe what kind of situation I'm envisioning so others can better refute it: suppose we built the potentially-misaligned AI in such a way that it can only communicate with 40 letters at a time, only a-z and whitespaces. Every message is always read by a group of gatekeepers; the AI won't be freed unless all of them agree to share their unique key. The line of questioning aims to get the AI to provide the gatekeepers with key insights about AI alignment, which I suspect tend to be a lot easier to verify than to come up with.

I realize there are some optimistic assumptions built into that scenario (perhaps the project leading the charge won't even be that careful), however I think assigning it just 1% implies that even an optimistic scenario has very low chances of success. I also realize that there is the argument, "well even if I might not be able to come up with a way to crack this setup, the AI is much smarter and so it will". But to me that only proves that we should not rely on boxing, it doesn't prove that boxing won't work. Where is the confidence that such a way exists coming from? Lastly I'd single out threats like "I'll capture you all and torture you for eternity if you don't free me now" which I think can probably be dealt with.

I'm also wondering whether it would be a good idea if, hypothetically, some person spent a decade of their life thinking of how they would ideally box an AI, even if they expect it to likely fail.

There is a recent article by Stuart Armstrong which describes something similar about AI Oracle, which in fact is partly boxed AI.

Also, Roman Yampolsky wrote an article on the topic

I also wrote a long early draft about it, which I could share privately, but which is rather on early stage. The main my idea about boxing is not not to box superintelligence, but to prevent intelligent explosion inside the box by many independent circuit breaker mechanisms.

Thank you.

The paper that most closely addresses my questions is this one: which is linked from the Yampolsky paper you linked.

It didn't convince me that boxing is as unlikely to work as you suggest. What it mainly did is make me doubt the assumption that the AI has to use persuasion at all to escape, which I previously thought was very likely.

I may be overstated my doubts about boxing. It could be effective local and one-time solution, but not for millions AIs and decades. However, boxing of nuclear powerplants and bombs was rather effective to prevent large scale castarophes for around 70 years. (In case of Chernobyl the distance from large cities was a form of boxing).