Humans get different counterfactuals

by Stuart_Armstrong1 min read23rd Mar 20152 comments

2

Counterfactuals
Personal Blog

A putative new idea for AI control; index here.

A lot of my ideas rely upon taking counterfactuals of events that have tiny ε probabilities of ever happening. This introduces some extra risks, mainly centred around human behaviour. For the counterfactual methods to work, we need the alternate world to be sensible, without things going crazy due to human action.

For instance, imagine that X defines an event where an AI would get turned on, with ¬X (of probability ε) corresponding to the AI failing to get turned on. There are two risks here: the first is that humans would react by saying "wow, an event of probability of ε actually happened; all our models must be wrong! Let's go crazy! Yay!". The second is that humans react by saying: "well, that was embarrassing - let's turn the AI on anyway."

To avoid this issue, imagine the following setup:

The "ON" signal first goes through an event Y, which has 99% chance of letting it through, then to the event X, which (as before) has a 1-ε chance of letting it through. The setup is designed so that humans cannot distinguish between ¬Y (the signal gets blocked at the first stage) and ¬X (the signal gets blocked at the second stage). This only needs to fool humans, not the AI itself.

The AI defines counterfactuals, as before, by looking at ¬X (possibly conditioning on Y versus ¬Y, if this is needed). Everything proceeds as previously from its perspective.

From the human perspective, however, the ¬X world is not distinguishable from the ¬Y one. Given (¬Y or ¬X), humans would conclude that ¬Y is the much more likely option:

P(¬Y|¬Y or ¬X)≈1-100ε.

So the ¬X counterfactual world (for the AI) is one where humans behave as if they were in the ¬Y world.

And ¬Y has one chance in a hundred of happening, which is unlikely, but not enough for humans to assume that their whole model of reality is wrong. Also, this is sufficiently likely that humans would give serious thought as to what to do in the ¬Y case, maybe arranging various pre-commitments or insurance options, making it possible to arrange (with high probability) that humans don't just ignore the result and try again immediately.

Note that this method can't be used (obviously) if ¬X is something hideously dangerous (like an unleashed UFAI), but in all other cases, it seems implementable.

Counterfactuals1
Personal Blog

2