1 Answers sorted by
top scoring

Jun 14, 2019

Donald Hobson gives a comment below explaining some reasoning around dealing with unknown unknowns, but it's not a direct answer to the question, so I'll offer that.

The short answer is "yes".

The longer answer is that this is one of the fundamental considerations in approaching AI alignment and is why some organizations, like MIRI, have taken an approach that doesn't drive straight at the object-level problem and instead tackles issues likely to be foundational to any approach to alignment that could work. In fact you might say the big schism between MIRI and, say, OpenAI, is that MIRI places greater emphasis on addressing the unknown whereas OpenAI expects alignment to look more like an engineering problem with relatively small and not especially dangerous unknown unknowns.

(note: I am not affiliated with either organization so this is an informed opinion on their general approaches, and also note that neither organization is monolithic and individual researches vary greatly in their assessment of these risks.)

My own efforts addressing AI alignment are largely about addressing these sorts of questions, because I think we still poorly understand what alignment even really means. In this sense I know that there is a lot we don't know, but I don't know all of what we don't know that we'll need to (so known unknown unknowns).

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:38 AM

[-]Donald Hobson7y180

If you add adhoc patches until you can't imagine any way for it to go wrong, you get a system that is too complex to imagine. This is the "I can't figure out how this fails" scenario. It is going to fail for reasons that you didn't imagine.

If you understand why it can't fail, for deep fundamental reasons, then its likely to work.

This is the difference between the security mindset and ordinary paranoia. The difference between adding complications until you can't figure out how to break the code, and proving that breaking the code is impossible (assuming the adversary can't get your one time pad, its only used once, your randomness is really random, your adversary doesn't have anthropic superpowers ect).

I would think that the chance of serious failure in the first scenario was >99%, and in the second, (assuming your doing it well and the assumptions you rely on are things you have good reason to believe) <1%

[-]Shmi7y30

It looks like the MIRI team has its hands full with known unknowns. The main one (of those released publicly) seems to be what they describe as “embedded agency” or “Agent Foundations” work:

our goal in working on embedded agency is to build up relevant insights and back-ground understanding, until we collect enough that the alignment problem is more manageable

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

6

[ Question ]

Unknown Unknowns in AI Alignment

6

6

1 Answers sorted by
top scoring

Jun 14, 2019

6

[ Question ]

Unknown Unknowns in AI Alignment

6

6

1 Answers sorted by top scoring

Jun 14, 2019

1 Answers sorted by
top scoring