Suppose there is a useful formulation of the alignment problem that is mathematically unsolvable. Suppose that as a corollary, modifying your own mind while ensuring any non-trivial property of the resulting mind was also impossible.

Would that prevent a new AI from trying to modify itself?

Has this direction been explored before?

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

JBlack

May 08, 2023

31

It has been explored (multiple times even on this site), and doesn't avoid doom. It does close off some specific paths that might otherwise lead to doom, but not all or even most of them.

Some remaining problems:

  • AI may be perfectly well capable of killing everyone without self-improvement;
  • An AI may be capable of some large self-improvement step, but not aware of this theorem;
  • Self-improving AI's might not care about whether the result is aligned with their former self, and indeed may not even have any goals at all before self-improvement;
  • AIs may create smarter AIs without improving their own capabilities, knowing that the result won't be fully aligned but expecting that they can nevertheless keep the result under control (and they were wrong);
  • In a population with many AIs, those that don't self-improve may be out-competed by those that do - leading to selection for AIs that self-improve regardless of consequences;
  • It is extremely unlikely that a mere change of computing substrate would meet the conditions of such a theorem, so an AI can almost certainly upgrade its hardware (possibly by many orders of magnitude) to run faster without modifying its mind in any fundamental way.

At this point my 5-minute timer on "think up ways things can still go wrong" ran out, and I just threw out the dumbest ideas and listed the rest. I'm sure with more thought other objections could be found.

Thanks!

It has been explored (multiple times even on this site), and doesn't avoid doom. It does close off some specific paths that might otherwise lead to doom, but not all or even most of them.

Do you have any specific posts in mind?

To be clear, I'm not suggesting that because of this possibility we can just hope that this is how it plays out and we will get lucky.

If we could find a hard limit like this, it seems like it would make the problem more tractable, however. It doesn't have to exist simply because we want it to exist. Searching for it still s... (read more)

1Kinrany1y
The problem of creating a strong AI and surviving, that is. We'd still get Hanson's billions of self-directed EMs.