Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Benja had a post about trying to get predictors to not manipulate you. It involved a predictor that could predict tennis matches, but where the prediction could also manipulate the impact of those matches.

To solve this, Benja imagined the actions of a hypothetical CDT reasoning agent located on the moon, and unable to affect the outcome.

While his general approach is interesting, it seems the specific problem has a much simpler solution: before the AI's message is outputed, it's run through a randomised scrubber that has a tiny chance of erasing it.

Then the predictor would try and maximise expected correctness of its prediction, given that the scrubber erased its output (utility indifference can have a similar effect). In practice the scrubber would almost never trigger, so we would get accurate predictions, unaffected by our reading of them.

Does this seem it'll work?

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 9:20 PM

I discussed this with Benja at a previous MIRIx workshop and I don't remember exactly what we concluded, but I think it mostly works, it just requires that people behave sensibly when they get scrubbed predictions.

Now that I think about it: to handle cases when people don't behave that sensibly with scrubbed predictions, maybe we want some kind of sequence of oracles, where oracle 0 outputs nothing, and oracle n+1 outputs what would happen if it were replaced with oracle n. We could take the limit as n approaches infinity, but then we don't know that much about which fixed point we will get (it will be controlled by subtle feedback loops), so maybe we want something like n=3 being most probable (although we will want to make n random between 0 and 3 so it's meaningful to condition on n=0, n=1, n=2).