I'm not sure if the 5/10 problem and the surrounding Löbian uncertainty is still an issue/area of research, but I've been struggling with the validity of this argument lately - I doubt this is a novel point so if this is addressed somewhere I'd appreciate being corrected. On the off chance that this hasn't been explained elsewhere I'd be really interested to hear peoples' thoughts
The proof as I understand it is roughly as below, with "A" referring to the agent's output, "U" referring to the environment's output, and "□" standing for the provability predicate:
- (A = 5) →(U= 5)
- (A = 5) →((A = 10)→(U=0))
- □(((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))) → (A = 5)
- □(((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))) → ((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))
- ((A = 5) → (U = 5)) ∧ ((A = 10) → (U = 0))
- A = 5
(I have probably butchered the parentheses sorry, I'm on mobile)
The key to this argument seems to be step 3, which is the step that seems flawed to me. Informally this is stating "If I prove that choosing $5 leads to $5 and choosing $10 leads to $0, then I will choose $5". This seems like a reasonable claim and is what allows the vacuously true conditionals that cause the whole Löbian troubles, but I don't see how an agent will be able to prove this. Step 3 seems like it can be interpreted in 2 ways, neither of which appear to work:
- This seems to me to be the more natural interpretation. The agent can prove (3.) because it can prove of itself that, when it finds a proof showing that it gains more utility on one path, it will halt its proof search and take the higher-utility path. So the agent reasons that, if "((A = 5) → (U = 5)) ∧ ((A = 10) → (U = 0))" is provable then it will eventually find this proof, halt, and output $5. But this halting aspect introduces a difficulty in that the agent must prove not only that the above is provable, but that no contradictory proof will be found earlier in the proof search. Even if the above is provable, if there is a proof earlier in the ordering of the statement "((A = 5) → (U = 0)) ∧ ((A = 10) → (U = 10))" (or any other statement of this form where $10 is valued higher than $5) then (3.) is false. The agent needs to be able to prove not only that the proof exists but that it will find this proof before it proves something else that halts it looking at proofs. The obvious way to guarantee this won't happen is for the agent to prove that if it any statement has a proof it doesn't have a shorter disproof, but of course it can't prove this. If the agent can't guarantee this won't happen, then the conditional seems to fail as they only assume the existence of the proof somewhere, and it could be gated behind other proofs which cause different actions.
- The agent can prove (3.) because it models itself as explicitly search for a proof of the specific statement "((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))". Therefore it doesn't matter if there are shorter contradictory proofs, the agent will just ignore them and keep going, only halting and returning an answer if it proves this specific statement. In this case it can indeed prove (3.), but then I'm not sure this is an interesting result anymore. We now have an agent which isn't looking for proofs about how to maximise utility and taking actions based on that, but instead has been hard-coded to look for a proof of a specific arbitrary statement, and only do something if it finds such a proof. In fact I think this agent can never output $10 - it'll either output $5 or run forever - any proof that it outputs $10 won't constitute a proof of "((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))" and therefore won't trigger a halt/decision.
It seems to me like the problem really arises from some ambiguity/sleight of hand around whether we're using "the agent proves P" to mean "P is provable in the theory T which represents the agent's output" or "the agent finds a proof of P in a specific computable search over the proofs of T", which might differ greatly depending on how the agent is set up to choose actions. On the most natural interpretation these two are very different, and the former doesn't imply the latter - the latter being what determines what the agent actually does. On the other (IMO less natural) interpretation, we have a pretty weird agent that's basically been doomed to take at most $5, so its behaviour isn't that strange