Benja had a post about trying to get predictors to not manipulate you. It involved a predictor that could predict tennis matches, but where the prediction could also manipulate the impact of those matches.
To solve this, Benja imagined the actions of a hypothetical CDT reasoning agent located on the moon, and unable to affect the outcome.
While his general approach is interesting, it seems the specific problem has a much simpler solution: before the AI's message is outputed, it's run through a randomised scrubber that has a tiny chance of erasing it.
Then the predictor would try and maximise expected correctness of its prediction, given that the scrubber erased its output (utility indifference can have a similar effect). In practice the scrubber would almost never trigger, so we would get accurate predictions, unaffected by our reading of them.
Does this seem it'll work?