Humans get different counterfactuals

Stuart_Armstrong

A putative new idea for AI control; index here.

A lot of my ideas rely upon taking counterfactuals of events that have tiny ε probabilities of ever happening. This introduces some extra risks, mainly centred around human behaviour. For the counterfactual methods to work, we need the alternate world to be sensible, without things going crazy due to human action.

For instance, imagine that X defines an event where an AI would get turned on, with ¬X (of probability ε) corresponding to the AI failing to get turned on. There are two risks here: the first is that humans would react by saying "wow, an event of probability of ε actually happened; all our models must be wrong! Let's go crazy! Yay!". The second is that humans react by saying: "well, that was embarrassing - let's turn the AI on anyway."

To avoid this issue, imagine the following setup:

The "ON" signal first goes through an event Y, which has 99% chance of letting it through, then to the event X, which (as before) has a 1-ε chance of letting it through. The setup is designed so that humans cannot distinguish between ¬Y (the signal gets blocked at the first stage) and ¬X (the signal gets blocked at the second stage). This only needs to fool humans, not the AI itself.

The AI defines counterfactuals, as before, by looking at ¬X (possibly conditioning on Y versus ¬Y, if this is needed). Everything proceeds as previously from its perspective.

From the human perspective, however, the ¬X world is not distinguishable from the ¬Y one. Given (¬Y or ¬X), humans would conclude that ¬Y is the much more likely option:

P(¬Y|¬Y or ¬X)≈1-100ε.

So the ¬X counterfactual world (for the AI) is one where humans behave as if they were in the ¬Y world.

And ¬Y has one chance in a hundred of happening, which is unlikely, but not enough for humans to assume that their whole model of reality is wrong. Also, this is sufficiently likely that humans would give serious thought as to what to do in the ¬Y case, maybe arranging various pre-commitments or insurance options, making it possible to arrange (with high probability) that humans don't just ignore the result and try again immediately.

Note that this method can't be used (obviously) if ¬X is something hideously dangerous (like an unleashed UFAI), but in all other cases, it seems implementable.

I am not at all sure why the humans wouldn't just turn the AI on again anyway if it were only 99% probable.

Anyway, this reminds me of an oracle system I devised for a fantasy story I never got around to writing - The oracle doesn't always respond, and if they do respond, they tell you what would have happened if they hadn't responded. One of the rules I quickly had to make for the Oracle was that if they didn't say anything, you didn't get to ask again.

I thought (at the time, some time ago) that the Oracle, seeking to be most helpful, would soon converge on answering only around 2/3 - 4/5 of the time so that people wouldn't go and do stupid things in response to the extreme upset of not getting an answer.

I am not at all sure why the humans wouldn't just turn the AI on again anyway if it were only 99% probable.

That's a human institution problem, that seems more solvable (at least, we shouldn't run the AI if it isn't solved).

I am not at all sure why the humans wouldn't just turn the AI on again anyway if it were only 99% probable.

I am not at all sure why the humans wouldn't just turn the AI on again anyway if it were only 99% probable.

That's a human institution problem, that seems more solvable (at least, we shouldn't run the AI if it isn't solved).