Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Safe probability manipulation, superweapons, and stable self-improvement research

0paulfchristiano

When I first discovered that even UDT would fail on certain "fair" problems (via "Evil" decision problems in provability logic), I was disappointed. It seemed to make things harder, in that there was no ideal decision theory, even on "fair" problems.

But then I realised I'd been handed a superweapon of extreme power. It's a tool that allows to constrain an AI of arbitrary intelligence to have the probabilistic beliefs that we desire.

For instance, take the "X" procedure that I used in the rather under-defined "counterfactual" post. We want the real probability of X=0 to be low, but for the AI to be unable to influence it (and, incidentally, unable to calculate it precisely).

Suppose that X could be 0, 1, etc... up to n. The AI runs a certain probability algorithm P. Then the procedure for calculating X is to run P on X=0, X=1, etc... X=n, and output the m for which P was lowest. In case of a tie, it outputs the highest number.

It seems clear that the AI must put P(X=0)=P(X=1)=...=P(X=n)=1/n, however we know that X=n (since the values are tied).

Basically we can use this to force the AI to be unable to influence or estimate certain events, even if it's much smarter than us.

This might provide a partial answer to Paul Christiano's point about the point of researching stable self-improvement. Because this kind of trick is not easily stable under self-improvement (eg if the AI upgrades P to P′ which can calculate stuff about P).

Similarly, my ideas about reduced impact require using the AI's predictive ability against the AI.

These seem to be cases where we can't just build a human level AI and "let it sort out stable self-improvement". Because here we've added constraints that the AI might "want" to get rid of, and that it potentially could. So it would be useful to know these constraints are stable.