[Edit: this isn't actually a spurious counterfactual.] The agent might reason "if I two-box, then either it's because I do something stupid (we can't rule this out for Lobian reasons, but we should be able to assign it arbitrarily low probability), or, much more likely, the predictor's reasoning is inconsistent. An inconsistent predictor would put $1M in box B no matter what my action is, so I can get $1,001,000 by two-boxing in this scenario. I am sufficiently confident in this model that my expected payoff conditional on me two-boxing is greater than $1M

An approach to the Agent Simulates Predictor problem

by AlexMennen 1 min read9th Apr 2016No comments


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.