Information bottleneck for counterfactual corrigibility

The main possibility I see for solving this problem is to combine the "push the counterfactuals closer to the real world" non-solution with some sort of myopia to prevent it from actually manipulating human behavior through it. But myopia seems in general ill-behaved to me, so I think it may be a bad idea. This implies an intermediate step: try to make myopia better behaved.

In the general case of myopia, we might imagine taking the utility function and splitting it into two parts, $u (x, y)$ , where the first parameter represents the non-myopic influence of $π$ , and the second parameter represents the myopic influence of $π$ . Then the optimization procedure would be something like gradient ascent on $u (π, s t o p g r a d (π))$ , or have an outcome of a Nash equillibrium of $π = arg {max}_{p} u (p, π)$ .

For instance, in the example with the stop button, given a policy $π$ , we can do a rollout to see how people react to it. And then we can generate counterfactuals of people's behavior that increase or decrease their tendency to press its stop button.

As the behavior in general is a function of the policy $π$ , and the counterfactuals in this case are a function of the behavior, we can regard the counterfactuals as a function of the policy to; $s (π)$ represents the behavior that resembles how humans would behave in response to $π$ , but modified so they try to stop the AI, while $f (π)$ represents the same behavior, but modified to never try to stop the AI.

But importantly, we want to be myopic with respect to the above counterfactuals. So we express them using the second, myopic parameter $y$ , and express the behavior that we wish to optimize using the first parameter, $x$ . So we myopically optimize $u (x, y) = B (x)_{s (x)} + V (x)_{f (x)}$ .

The big problem I have with this is that myopia seems really unpredictable. A myopic optimization step might make the policy perform worse on the objective, and myopic optimization might oscillate endlessly. This doesn't seem like something we could usefully prove much about.

But I was made aware of a transformation we can do to the problem to perhaps make it better behaved. If we have some myopically optimal policy $π$ , it would satisfy something like $u (π, π) = {max}_{p} u (p, π)$ . We can then transform this to $u (π, π) - {max}_{p} u (p, π) = 0$ . Now notice that the LHS must necessarily be nonpositive regardless of $π$ , because $u (π, π) \leq {max}_{p} u (p, π)$ . So the equation for myopic optimality holds only if the function $u^{*} (π) = u (π, π) - {max}_{p} u (p, π)$ is maximized.

This seems somewhat less mathematically horrible than the stopgrad expression, or the recursive Nsh equillibrium equation. Though it still involves an inner max expression, so it's not entirely great. But maybe when reasoning about myopic optimization, we should use this transformation instead?

I'm not sure though, it might be the entirely wrong track. It would be nice to have a better understanding of what exactly $u^{*}$ is doing. One downside is that this genuinely changes the problem. We are no longer optimizing with respect to one variable only, as was intended with myopia. Rather, we're measuring the potential utility achieved with the myopic variable, and subtracting it off in the hope that it cancels out the optimization. But this seems highly reliant on the linear structure of utilities/loss in a way I don't like.

I'm currently planning on exploring this direction further, since I don't have any better ideas.

LESSWRONG
LW

LESSWRONG
LW

8

Information bottleneck for counterfactual corrigibility

8

8

Counterfactual stop button

Out of distribution

Meetup planning

What's so bad about OOD, anyway?

Really?

Conclusion