A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.
First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.
What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.
Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.
All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.