Suppose we amend ASP to require the agent to output a full simulation of the predictor before saying "one box" or "two boxes" (or else the agent gets no payoff at all). Would that defeat UDT variants that depend on stopping the agent before it overthinks the problem?

(Or instead of requiring the the agent to output the simulation, we could use the entire simulation, in some canonical form, as a cryptographic key to unlock an encrypted description of the problem itself. Prior to decrypting the description, the agent doesn't even know what the rules are; the agent is told in advance only that that decryption will reveal the rules.)

An approach to the Agent Simulates Predictor problem

by AlexMennen 1 min read9th Apr 2016No comments


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.