The map of "Levels of defence" in AI safety

[-]avturchin8y60

CC from facebook: Rob Bensinger The boxes labeled "MIRI's strategy" in the chart above look mostly wrong to me at a glance. You can find a good summary of MIRI's thoughts about strategy here: https://intelligence.org/2017/12/01/miris-2017-fundraiser/#3.

I'd also want to distinguish organizational strategy from strategic outlook. MIRI's organizational strategy is separate from the list of things we want some group in the world to do. (E.g., MIRI isn't an AGI shop.) There are also things like "an AI winter occurs" that can be helpful if they occur (to give more time for safety work), without being good intervention points for anyone.

Me: I will update based on your recent document.

[-]avturchin8y10

The new version of the map which is updated based on the comment by Rob Bensinger about MIRI 2017 strategy is up.

The difference between explicit organisational strategy and the strategic outlook is presented in different colors of the boxes.

Find the full version in the pdf: https://goo.gl/XH3WgK

[-]hum38y20

I really like your map as it starts to give me a framework for dealing with the whole issue. The percentages of success are depressing low.

Under 0 Preliminary measures you don't put your personal estimate of % chance of success. You also put in destruction of AI labs. Is not destruction/taxation of semiconductor fabs an easier target (Wikipedia has a list)? I think they are also so expensive they are harder to hide.

ps spelling error desireble -> desirable in bottom right yellow legend. and Prelimnary=> preliminary

[-]avturchin8y10

Thanks for error pointing. I don't think that anyone actually plotting nuclear strikes on AI labs. However, Putin or Kim could think that it is their only chance to preserve power in AI age. Anyways, it can't be regarded as success. It is either a global catastrophe or a short delay.

But I will think how to add estimation to successfull AI ban.

[-]hum38y20

A bit gloomy as only global catastrophe or delayed catastrophe.

I thought we had 40 years but with Elon Musk talking about 7-8 years for an AGI and with the recent 4 hour training to get to [world chess supremacy][chess] I am not so sure. So I think we need to buy some time. Even if you can't destroy the semiconductor fabs you could still increase taxes. this could be marketed as helping to pay for societies dislocation while we undergo job losses.

[-]avturchin8y20

I also think that there is only several years until dangerous AI.

See my presentation about it:

https://www.academia.edu/34863051/Near-term

However, I think that war will only increase extinction risks - and even AI risks, as it will increase arms race and stops ethical thinking. Also, a strike on Silicon valley will kill best minds in AI safety, but some obscure chineese labs will continue to exist.

[-]hum38y20

Thank you for that reference. I hadn't seen a quantification of the Bitcoin computer capacity which was interesting and high.

[-]Rafael Harth8y20

This is probably as good a place to talk about this it as any-

I get a sense that most people who "understand" the alignment problem think boxing has almost no chance of working. While I completely agree that it is unsafe, that relying on it would be a horrible idea, and even that it is unlikely to work, I've never seen anything that has convinced me that it is in the 1% area. Usually, discussion (such as in Superintelligence) only goes far enough to point out why naive boxing will fail, and it is then assumed that the entire approach is a dead end, which might be a reasonable assumption but I don't find it obvious.

I'll briefly describe what kind of situation I'm envisioning so others can better refute it: suppose we built the potentially-misaligned AI in such a way that it can only communicate with 40 letters at a time, only a-z and whitespaces. Every message is always read by a group of gatekeepers; the AI won't be freed unless all of them agree to share their unique key. The line of questioning aims to get the AI to provide the gatekeepers with key insights about AI alignment, which I suspect tend to be a lot easier to verify than to come up with.

I realize there are some optimistic assumptions built into that scenario (perhaps the project leading the charge won't even be that careful), however I think assigning it just 1% implies that even an optimistic scenario has very low chances of success. I also realize that there is the argument, "well even if I might not be able to come up with a way to crack this setup, the AI is much smarter and so it will". But to me that only proves that we should not rely on boxing, it doesn't prove that boxing won't work. Where is the confidence that such a way exists coming from? Lastly I'd single out threats like "I'll capture you all and torture you for eternity if you don't free me now" which I think can probably be dealt with.

I'm also wondering whether it would be a good idea if, hypothetically, some person spent a decade of their life thinking of how they would ideally box an AI, even if they expect it to likely fail.

[-]avturchin8y10

There is a recent article by Stuart Armstrong which describes something similar about AI Oracle, which in fact is partly boxed AI. https://arxiv.org/abs/1711.05541

Also, Roman Yampolsky wrote an article on the topic https://arxiv.org/abs/1604.00545

I also wrote a long early draft about it, which I could share privately, but which is rather on early stage. The main my idea about boxing is not not to box superintelligence, but to prevent intelligent explosion inside the box by many independent circuit breaker mechanisms.

[-]Rafael Harth8y20

Thank you.

The paper that most closely addresses my questions is this one: http://cecs.louisville.edu/ry/LeakproofingtheSingularity.pdf which is linked from the Yampolsky paper you linked.

It didn't convince me that boxing is as unlikely to work as you suggest. What it mainly did is make me doubt the assumption that the AI has to use persuasion at all to escape, which I previously thought was very likely.

[-]avturchin8y10

I may be overstated my doubts about boxing. It could be effective local and one-time solution, but not for millions AIs and decades. However, boxing of nuclear powerplants and bombs was rather effective to prevent large scale castarophes for around 70 years. (In case of Chernobyl the distance from large cities was a form of boxing).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

6

The map of "Levels of defence" in AI safety

6

6