For a more technical angle, has anyone thought about making strong AIs stoppable by giving them wrong priors? For example, an AI for doing physics research could start with a prior saying the experiment chamber is the whole universe, any "noise" coming from outside is purely random and uncorrelated across time, and a particular shape of "noise" should make the AI clean up and shut down. That way no amount of observation or self-improvement can let it infer our existence, so we'll be able to shut it down. That should be easy to formalize in a cellular automaton world, though real physics is of course much harder.

[-]Rafael Harth7y40

This sounds totally convincing to me.

Do you think that ethical questions could be more relevant for this than they are for alignment? For example, the difference between [getting rid of all humans] and [uploading all humans and making them artificially incredibly happy] isn't important for AI alignment since they're both cases of unaligned AI, but it might be important when the goal is to navigate between different modes of unaligned AI.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

14

An introduction to worst-case AI safety

14

14