Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

It seems that some of the problems with orthogonality go away if you use the AI's actions to define the counterfactuals.

In the traditional model "Press" meant a button is pressed that compels the AI to change utility , and it gets a reward . Problem was, this could cause the AI to compel people to press the button.

Instead I propose that is an action that the AI can choose to take, and that "Press" allows the AI to take that action. Then it would get rewarded with , calculated by the AI according to counterfactuals on its own actions (somehow; details left to the student to fill in).

We could add a small (or infinitesimal) on top of that so that the change is the decision the AI reaches. This makes it likely to encourage "Press", but only as a tie breaker decision.

Why is this immune to the usual attempts by the AI to cause (or deny) "Press"? Simply because if the AI decided not to , then it only loses an . Thus causing "Press" to happen (which opens the option ) will only gain it an at max.

New Comment