Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

*(A longer text-based version of this post is also available on MIRI's blog* *here, and the bibliography for the whole sequence can be found* *here.)*

*The next post in this sequence, 'Embedded Agency', will come out on Friday, November 2nd.*

*Tomorrow’s AI Alignment Forum sequences post will be 'What is Ambitious Value Learning?' in the sequence 'Value Learning'.*

Thank you for your explanation! Still trying to understand it. I understand that there is no point examining one's algorithm if you already execute it and see what it does.

I don't understand that point. you say "nothing stops you", but that is only possible if you could act contrary to your own algorithm, no? Which makes no sense to me, unless the same algorithm gives different outcomes for different inputs, e.g. "if I simply run the algorithm, I take $10, but if I examine the algorithm before running it and then run it, I take $5". But it doesn't seem like the thing you mean, so I am confused.

How can it be possible? if your examination of your algorithm is accurate, it gives the same outcome as mindlessly running it, with is taking $10, no?

So your reasoning is inaccurate, in that you arrive to a wrong conclusion about the algorithm output, right? You just don't know where the error lies, or even that there is an error to begin with. But in this case you would arrive to a wrong conclusion about the same algorithm run by a different agent, right? So there is nothing special about it being your own algorithm and not someone else's. If so, the issue is reduced to finding an accurate algorithm analysis tool, for an algorithm that demonstrably halts in a very short time, producing one of the two possible outcomes. This seems to have little to do with decision theory issues, so I am lost as to how this is relevant to the situation.

I am clearly missing some of your logic here, but I still have no idea what the missing piece is, unless it's the libertarian free will thing, where one can act contrary to one's programming. Any further help would be greatly appreciated.