A putative new idea for AI control; index here.
This idea, due to Eric Drexler, is to separate out the different parts of an AI into modules. There would be clearly designated pieces, either physical or algorithmic, with this part playing a specific role: this module would contain the motivation, this module the probability estimator, this module the models of the outside world, this module the natural language understanding unit, etc...
It's obvious how such a decomposition would be useful for many of the methods I've been detailing here. We could also distil each module - reduce it to a smaller, weaker (?) and more understandable submodule, in order to better understand what is going on. In one scenario, an opaque AI gets to design its successor, in the form of a series of such modules.
This property seems desirable; the question is, how could we get it?
EDIT: part of the idea of "modules" is that AIs often need to do calculations or estimations that would be of great value to us if we could access them in isolation. This idea is developed more in these posts.
Designing in modules
The main threat here is that a given submodule would contain more than just the properties we want. After all, a natural language parser could consist of a general intelligence plus a motivation to understand language. Another possible worry is that the modules are overfitted to the problem or to each other: the language parser works perfectly, but only in this one AI design.
There are several ways we could try and combat this.
- Minimum length to the program (Eric's original main idea). The module must be as short as possible for its purpose.
- If possible, start with a modularised AI, and track how the modules change as the AI's intelligence grows.
- Have the AI extend the module to make it more transparent, without adding any extra functionality.
- Check that the input-output map the AI presents (maybe require it so present it as a simple primitive recursive function with clear and simple "for" loops) is of commensurable complexity to the task at hand.
- Add some noise to the situation: the new AI will not get exactly the same data as the old, the modules may get somewhat altered, and still have to work. Bugs will be introduced (this approach helps to combat the overfitting aspect of minimum length description).
- Require that the module be "distant", according to some metric of modification to the module's program, from having other capabilities. For instance, a general intelligence motivated to understand language is easy to make into a model-making module, simply by changing the motivation; more narrow modules are probably more distant from each other.
- Similarly, require that the performance of the modules becomes degraded if the world (or the world's descriptions) were substantially different, but not if they are the same information presented in a very different way.
- General anti-overfitting methods, calibrated to not generate a push towards general abilities in the modules. This would require some skill and a lot of thought.
- Focus on requiring statistical accuracy, rather than perfect performance.
If we allow false counterfactuals, then we can also:
- Imagine that the modules would be connected up to other modules, made by (hypothetical) different AIs, with potentially different motivations.
- Require that the modules not be useful as seeds for an AI of opposite motivation. For instance, an AI with opposite motivation would be allowed to flip a certain number of bits; if there was a general intelligence piece inside the module, then that would be something easier to co-opt, so dangerous to hand to an AI with opposite motivation. This is the "opposed" version of requiring modules to be "distant" from each other's capabilities.
- In general, all the reduced impact/satisficing requirements push towards having modules that cannot become general intelligences.
- The AI will dictate how to program these modules to a smart programmer, who will (counterfactually) not survive nor get to program the modules in the outside world.
Obviously anti-restriction-hacking would be useful to just module separation (and vice versa).
This is the beginning of the process of defining this, but it would be great to have a safe(ish) method of separating modules in this way.