Are there any impossibility theorems for the existence of AI that is both strong and safe? I think such theorems would be interesting because they could help to evaluate proposals for safe AI: we could ask "which assumption does this proposal break?"

I have a vague sense that a theorem of this sort might be able to be developed along the following lines:
 1. The kind of strong AI that we want is a technological tool such that it's easy to tell it what to do, and it can successfully do a wide variety of complex things when told
 2. Simple instructions + complex results -> AI has a lot of flexibility in its action
 3. There are only a few ways to reliably achieve goals requiring complex behaviour e.g. something approximating expected utility maximisation
 4. 2+3 + instrumental convergence -> flexibility is likely to be exploited in dangerous ways

Do fleshed out versions of this argument exist? Do you have any other ideas about impossibility theorems?

New Answer
Ask Related Question
New Comment

2 Answers sorted by

(Disclaimer: I've only skimmed this paper)

You might be interested in Impossibility results in AI: A Survey (Brcic & Yampolskiy 2022).

I think this sort of counts?

https://www.lesswrong.com/posts/WCX3EwnWAx7eyucqH/corrigibility-can-be-vnm-incoherent

I've also sort of derived some informal arguments myself in the same vein, though I haven't published them anywhere.

Basically, approximately all of the focus is on creating/aligning a consequentialist utility maximizer, but consequentialist utility maximizers don't like being corrected, will tend to want to change your preferences, etc, which all seems bad for alignment.

1 comments, sorted by Click to highlight new comments since: Today at 11:38 PM

For something to be a theorem, it has to be based on a sound set of axioms. So, I guess, the first question to ask is "What set of axioms can be useful for constructing a model of an emergent strong AI?"  I suspect that some of the deep minds think about issues like that for a living, though I don't personally know if there is anything formal like that. It would be a very interesting development, though. Maybe it would let us formalize some of the ideas you are listing, as well as many others. There is certainly some work being published by MIRI and others, but at this point I don't think it is at the level required for anything like a no-go theorem, which is what you are asking for.