Un-manipulable counterfactuals

Stuart_Armstrong

Un-manipulable counterfactuals — LessWrong

1 Un-manipulable counterfactuals

by Stuart_Armstrong

12th Feb 2015

AI Alignment Forum

1 min read

1 Ω 1

This is how I design my counterfactuals: take some stocahstic event that the AI cannot manipulate. This could be a (well defined) chaotic process, the result of a past process that has been recorded and not revealed yet, or maybe something to do with the AI's own decisions, calibrated so that the AI cannot access the information.

Then I have the world setup to make what we care dependent on that stochastic event. So, for instance, the output of an oracle is erased (before being read) dependent on this, the AI's utility gets changed if one particular value comes up (in conjunction with something else).

I then define the counterfactual on the stochastic process. So if $X = 1$ implies the AI changes their utility, then the counterfactual is simply $X = 0$ . We can set the probability so that $X = 1$ is almost certain, but $X = 0$ is possible.

This seems to me the cleanest way of defining counterfactuals. Any obvious (or less obvious) problems?

Counterfactuals

Personal Blog

1 Ω 1

Un-manipulable counterfactuals

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:13 PM

[-]paulfchristiano11yΩ240

As Nate pointed out, using these counterfactuals to make decisions gives you CDT. I posted an implementation of CDT using this idea a while ago. Here is a more general idea along these lines. It's also used here and here.

In general, conditioning on extremely low probability events can lead to strange beliefs (e.g. skeptical hypotheses), which places lower bounds on the safe probability of X=0. If you try to give the AI a really confident model of the world, to eliminate skeptical hypotheses, then stuff can get weird fast.

More seriously, the behavior of these agents is very sensitive to the behavior of humans in extremely improbable cases. For example, you proposed having a button that operates the explosives with probability 99.9999%, and an AI who erroneously believes that the explosives will never go off (or who just makes decisions based on what will happen if the explosives never go off--since the two systems have formally identical behavior, it doesn't seem worth distinguishing). But what do the humans actually do, when the button that is supposed to work 99.999999% of the time doesn't? Without careful measures, they probably panic and turn off the machine anyway.

In general, it seems better to avoid conditioning on extremely unlikely events for both of these reasons. But conditioning on reasonably likely counterfactuals is a natural thing to do. And for some use cases, this may not be a problem.

[-]So8res11yΩ120

Patrick and I discussed something like this at a previous MIRIx. I think the big problem is that (if I understand what you're suggesting) it basically just implements CDT.

For example, in Newcomb's problem, if X=1 implies Omega is correct and X=0 implies the agent won't necessarily act as predicted, and it acts conditioned on X=0, then it will twobox.

[-]Stuart_Armstrong11yΩ000

I'm not sure I understand this.

The example I was thinking of was instead of eg conditioning on "the button wasn't pressed" in corrigibility, you have corrigibility only implemented if the button is pressed AND X=1. Then the counterfactual is just X=0.

Is there a CDT angle to that?

[-]So8res11yΩ120

We might be talking about different things when we talk about counterfactuals. Let me be more explicit:

Say an agent is playing against a copy of itself on the prisoner's dilemma. It must evaluate what happens if it cooperates, and what happens if it defects. To do so, it needs to be able to predict what the world would look like "if it took action A". That prediction is what I call a "counterfactual", and it's not always obvious how to construct one. (In the counterfactual world where the agent defects, is the action of its copy also set to 'defect', or is it held constant?)

In this scenario, how do you use a stochastic event to "construct a counterfactual"? (I can think of some easy ways of doing this, some of which are essentially equivalent to using CDT, but I'm not quite sure which one you want to discuss.)

[-]Stuart_Armstrong11yΩ000

I see why you think this gives CDT now! I wasn't meaning for this to be used for counterfactuals about the agent's own decision, but about an event (possibly a past event) that "could have" turned out some other way.

The example was to replace the "press" with something more unhackable.

Moderation Log