A putative new idea for AI control; index here.
An idea for how an we might successfully get useful work out of a powerful AI.
The ultimate box
Assume that we have an extremely detailed model of a sealed room, with a human in it and enough food, drink, air, entertainment, energy, etc... for the human to survive for a month. We have some medical equipment in the room - maybe a programmable set of surgical tools, some equipment for mixing chemicals, a loud-speaker for communication, and anything else we think might be necessary. All these objects are specified within the model.
We also have some defined input channels into this abstract room, and output channels from this room.
The AI's preferences will be defined entirely with respect to what happens in this abstract room. In a sense, this is the ultimate AI box: instead of taking a physical box and attempting to cut it out from the rest of the universe via hardware or motivational restrictions, we define an abstract box where there is no "rest of the universe" at all.
Cure cancer! Now! And again!
What can we do with such a setup? Well, one thing we could do is to define the human in such a way that they have some from of advanced cancer. We define what "alive and not having cancer" counts as, as well as we can (the definition need not be fully rigorous). Then the AI is motivated to output some series of commands to the abstract room that results in the abstract human inside not having cancer. And, as a secondary part of its goal, it outputs the results of its process.
Initially, the AI is likely to simply kill the human, using some hole in the definition of "alive and not having cancer" that we haven't seen. We can then tighten up the definition, and possibly give the AI more tools inside the abstract room. After a few iterations of this, the AI is likely to be able to "cure cancer" in a way that meets our approximate requirements.
And then, it suffices for us to analyse the process used, and its outcome, to establish how the AI went about its task. If the room was defined in sufficient detail, it is likely that that the process can be adapted to curing cancer in the real world. If not, we can vary the input - defining agents with different types of cancer, comparing the AI's approaches, maybe requiring the AI's solution to work across the different situations. Thus it seems that we have extracted a cure for cancer from a superintelligent AI.
Why it works, if it works
This idea is in some sense a reversal of Paul Christiano's idea. Instead of having an AI whose motivation or decisions are constrained by some abstract human, the AI's decisions are restricted to having an impact in some abstract world. In another sense the idea is an inverse of the `cartesian boundary' of the AIXI. The AIXI is operating on the real world, while pretending that it exists in an abstract world. Here, it exists in the real world, while operating on an abstract world.
The important thing is to ensure that the AI does not actually have any preferences over the real world (including over future versions of itself, running on the same or similar problems). This can be achieved via high discount rates or resetting, appropriate indifference, variants of acausal trade barriers (to make it indifferent to achievements of future versions of itself), reduced impact, or more simply through rigorous programming of the problem definition.
Thus it seems that the AI would not be motivated to include tricks within its output, things that could trap or seduce us outsiders if we were to peer too deeply into the outcome of the AI's interaction with the room. There remains a risk that the AI might create a general intelligence within the room itself. However, the risk should be low, as the general intelligence should still be motivated to have impacts only within the room. And we can ask the AI if there is a general intelligence (specified according to some imperfect criteria) in the room's history - and then change elements inside the room until there is no such intelligence. Note that it's important that we ask the AI, rather than specifying "have no general intelligence in the room". For the AI can hack around our imperfect specification, but has no reason to do so if we only ask.
To prevent any social hacking of the virtual human that might accidentally rebound on us, it would be best if the virtual human was obedient and dumb. Obedient so the AI didn't need to do any social hacking, and dumb so that it couldn't use the human for detailed complicated tasks (such as creating a general intelligence in the room).
These thoughts give a hint as to how we want the AI to interact with the room. Unrestricted access to the output and input is one possibility. Creating a subagent of low capacity (defined by hard drive space and processing speed, for instance) to interact with the room would be safer. Note that "hard drive space" and "processing speed" can be defined rigorously by defining them in (an extension to) the virtual room, and having the subagent run there. The extension need not run according to the laws of physics, it could simply be an abstract model of computation. In fact, the more of the AI we can move into the room, where operations can be be rigorously defined, the better. Ultimately, we might be able to put the whole AI within the (extended) room.
The key features that would make this idea work is that AI over-optimisation is a) constrained to the virtual world, b) safely detectable, and c) the environment can be modified perfectly in any way we want, and the experiment rerun. By having such perfect control over the virtual world, we aim to perfectly control what the features of a correct solution are.
As the model gets worse...
To get the above, we've assumed a quasi-perfect model of the human and the room. At the very least, this would require whole-brain emulations. An interesting question is what happens as the quality of the model degrades.
One could imagine that if the AI produced and ran a limited subagent, that it would be possible to run that subagent in an environment approximately equivalent to the ideal modelled room (stopping the experiment if anything seemed to get dangerous). Or maybe we could get the AI not to return the ideal strategy, but a broad category of strategies within which the ideal strategy could be found.
But in general, I'd say the issue of running it in an inferior model is unsolved.
Of course, for some problems (those that are more like abstract mathematical models), we might be able to fully specify the model, even today. This is roughly equivalent to various ideas that have already been proposed about running AIs in virtual worlds/running proof finding AIs with proof checkers.
Evolution, search, overfitting
One of the greatest advantages of this approach is that it allows for human testing, error correction, and subsequent trials. This plays to the humanity's strength (adaptation and understanding of things after the fact) rather than to its weaknesses (getting everything perfect ahead of time).
But this opens up some weakness. If we do too much testing, changing, etc... we are opening the whole setup to an optimisation process that acts in unpredictable ways. This optimisation process is, broadly speaking, human approval of AI actions - and that is not a safe optimisation process to follow. Similar issues arise if we allow too much search or evolve the AI using some fitness function appropriate to our approval. Large scale searching and optimisation is still not safe.
Nor is it safe to use these virtual worlds to test the AI, check that it "behaves" and then let it have real world impacts. The real world and our stripped down and controlled virtual world are very different, and we cannot interpolate from one to the other.
However, we could use the virtual world as a test bed for AI failure modes; the various ways AIs could misbehave might suggest things about controlling AIs in the real world. Though beware that the more similar we make the virtual world to the real world, the higher the probability that the AI could come up with something dangerous in the virtual world that might also be dangerous to have in the real world - or ever dangerous to observe, from the real world.
With all these caveats, it does seem that there are many uses we could put this kind of approach, apart from disease curing. Manufacturing, virtual experiments, and similar are immediate ideas. I'm sure other will easily suggest themselves.