A putative new idea for AI control; index here.

EDIT: the mathematics of this approach can be found here.

In a previous post, I described a procedure for querying an AI about whether a certain outcome was desirable for it. When I did so, I didn't realise the power of the idea.

In brief, that idea allows us to query an AI with utility v as to whether an outcome X or an outcome ¬X was the one it most desired in a particular situation. By judicious use of 'false beliefs', the AI would react as if it had miraculous powers to determine X vs ¬X in ways that no-one could ever know about.

Many (including Toby Ord) has suggested variants on the 'honey pot' idea: leave something available to the AI that looks like an oversight, but is actually a trap, to see if the AI goes beyond the parameters of what we'd expect. Maybe a wire to the internet that the AI is not supposed to use, but seems to be unobserved.

The problem with these ideas is that the honey pots had to be physically constructed, physically accessible to the AI, and the AI had to be wrong about whether it was a trap or not.

But using the method above, we can go much further. We can test the AI in any scenario which we can imagine, as long as we can precisely physically describe it. We can check what would happen if the AI was given the ability to launch the world's nuclear arsenals, or divert an asteroid towards Earth. We can check what it would design if it was given unrestricted access to a DNA synthesiser. We could see what it would do if it could rewrite the text of a speech an important world leader was due to give. We could see whether it would sabotage the creation of other AIs - and if so, which ones. We could see what it did to a probe humanity planned to launch to the stars, what numbers it would call if it had access to a phone, what it would purchase if it had capital.

And, in all cases, the AI would be behaving as if its actions were undetectable and certain to succeed, while we ensured that they were (almost certainly) detectable and certain to fail. There is no "almost" in that certain to fail condition, because the AI is acting as if it had miraculous powers that it doesn't have.

Now, we can't use testing as our sole method of AI control (for reasons similar to here), but it does allow us some pretty effective ways of catching bad AIs.

New Comment
14 comments, sorted by Click to highlight new comments since:

Are these posts reproduced somewhere else? Do you have an AI blog where you post them? I think these AI control posts deserve more attention than they gain here (for whatever reason).

Some of them are posted on a private working blog, but they all can be accessed via http://lesswrong.com/lw/lt6/newish_ai_control_ideas/

Are you assuming some way of preventing the AI from acting strategically, with the knowledge that some of it's choices are likely to be tests? If the AI is, in fact, much smarter than the gatekeepers, wouldn't you expect that it can determine what's a trap and what's a real decision that will actually lead to the desired (by the AI) outcome?

I think there's a timeframe issue as well. You might be able to simulate an immediate decision, but it's hard to test what an AI will do with some piece of information after it's had 10 years of self-improvement and pondering.

Are you assuming some way of preventing the AI from acting strategically, with the knowledge that some of it's choices are likely to be tests?

I've used utility indifference to avoid that. The AI is perfectly aware that these are, in fact, tests. see http://lesswrong.com/r/discussion/lw/mrz/ask_and_ye_shall_be_answered/ and http://lesswrong.com/r/discussion/lw/lyh/utility_vs_probability_idea_synthesis/ and ask me if you have further questions.

As I keep pointing out, "utility indifference" is exactly the same as giving an infinitely strong prior.

So you might as well say "we gave the AI an infinitely strong prior that the event wasn't a test."

What does a reasoner do when it sees evidence that it's being tested, given that it has an infinitely strong prior that it isn't? It's going to behave pretty unpredictably, probably concluding it is fundamentally mistaken about the world around it. Utility indifference results in equally unpredictable behavior. Now it's not because the agent believes it is fundamentally mistaken, but because it only cares about what happens conditioned on being fundamentally mistaken.

You can arrange things to minimise the problems - for instance having the outcome X or ¬X be entirely random, and independent of the AI's impacts.

I wondered about that too, but this is not like a child that might notice that his parents are not actually looking when they tell that they are looking. The AI is built in a way - via the mentioned techniques - that it doesn't matter whether the AI understands the reasons. Because it treats them as irrelevant. Only if we have overlooked some escape route might it make a difference but an exhaustive search could plug such holes really.

We don't need to describe the scenarios precisely physically. All we need to do is describe it in terms of the agent's epistemology, with the same sort of causal surgery as described in Eliezer's TDT. Full epistemological control means you can test your AI's decision system.

This is a more specific form of the simulational AI box. The rejection of simulational boxing I've seen relies on the AI being free to act and sense with no observation possible, treating it like a black box, and somehow gaining knowledge of the parent world through inconsistencies and probabilities and escaping using bugs in its containment program. White-box simulational boxing can completely compromise the AI's apparent reality and actual abilities.

May be we could use it in more recursive form. I may ask AI X what would do AI X if it have an opportunity to do bad think Y? Would you kill me if you have a gun? While it could lie here, we may have other ways to test in general honesty. May be by using to different AIs: I ask AI X what would do AY Z in case if it has access to Y. Or by running many trails.

On 'certain to fail'... what if it would have pursued plan X that requires only abilities it has, but only if it had ability Y that it doesn't have and you made it think it has, that comes up in a contingency that turns out not to arise?

Like for a human, "I'll ask so-and-so out, and if e says no, I'll leave myself a note and use my forgetfulness potion on both of us so things don't get awkward."

Only for a world-spanning AI, the parts of the contingency table that are realizable could involve wiping out humanity.

So we're going to need to test at the intentions level, or sandbox.

This is a good point. Theory of second-best implies that if you take an optimal recommendation and take away any of the conditions, there is no guarantee that the remaining components are optional without the missing one.

for links, you need to put the "[" brakets for the text, "(" for the links ;-)

Thanks, fixed.

A better theory of counterfactuals - that can deal with events of zero probability - could help here.