Orthogonality: action counterfactuals

It seems that some of the problems with orthogonality go away if you use the AI's actions to define the counterfactuals.

In the traditional model "Press" meant a button is pressed that compels the AI to change utility $u \to v$ , and it gets a reward $E (u | u ↛ v) - E (v | u \to v)$ . Problem was, this could cause the AI to compel people to press the button.

Instead I propose that $u \to v$ is an action that the AI can choose to take, and that "Press" allows the AI to take that action. Then it would get rewarded with ${max}_{a} E (u | a \neq u \to v) - E (v | u \to v)$ , calculated by the AI according to counterfactuals on its own actions (somehow; details left to the student to fill in).

We could add a small (or infinitesimal) $ϵ$ on top of that so that the change $u \to v$ is the decision the AI reaches. This makes it likely to encourage "Press", but only as a tie breaker decision.

Why is this immune to the usual attempts by the AI to cause (or deny) "Press"? Simply because if the AI decided not to $u \to v$ , then it only loses an $ϵ$ . Thus causing "Press" to happen (which opens the option $u \to v$ ) will only gain it an $ϵ$ at max.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

0

Orthogonality: action counterfactuals

0

0