Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

*(A longer text-based version of this post is also available on MIRI's blog* *here, and the bibliography for the whole sequence can be found* *here.)*

*The next post in this sequence, 'Embedded Agency', will come out on Friday, November 2nd.*

*Tomorrow’s AI Alignment Forum sequences post will be 'What is Ambitious Value Learning?' in the sequence 'Value Learning'.*

I think I don't understand the Löb's theorem example.

If A=5⇒U=5∧A=10⇒U=0 is provable, then A=5, so it is true (because the statement about A=10 is vacuously true). Hence by Löb's theorem, it's provable, so we get A=5.

If A=10⇒U=0∧A=10⇒U=10 is provable, then it's true, for the dual reason. So by Löb, it's provable, so A=10.

The broader point about being unable to reason yourself out of a bad decision if your prior for your own decisions doesn't contain a "grain of truth" makes sense, but it's not clear we can show that the agent in this example will definitely get stuck on the bad decision - if anything, the above argument seems to show that the system has to be inconsistent! If that's true, I would guess that the source of this inconsistency is assuming the agent has sufficient reflective capacity to prove "If I can prove A=5, then A=5. Which would suggest learning the lesson that it's hard for agents to reason about their own behaviour with logical consistency.

(There was a LaTeX error in my comment, which made it totally illegible. But I think you managed to resolve my confusion anyway).

I see! It's not

provablethat Provable(A=10⇒U=10) implies A=10. It seems like it should be provable, but the obvious argument relies on the assumption that, if *A=10⇒U=0 is provable, then it's not also provable that A=10⇒U=10 - in other words, that the proof system is consistent! Which may betrue, but is notprovable.The asymmetry between 5 and 10 is that, to choose 5, we only need a proof that 5 is optimal, but to choose 10, w

... (read more)