by acgt

# 6

I'm not sure if the 5/10 problem and the surrounding Löbian uncertainty is still an issue/area of research, but I've been struggling with the validity of this argument lately - I doubt this is a novel point so if this is addressed somewhere I'd appreciate being corrected. On the off chance that this hasn't been explained elsewhere I'd be really interested to hear peoples' thoughts

The proof as I understand it is roughly as below, with "A" referring to the agent's output, "U" referring to the environment's output, and "□" standing for the provability predicate:

1. (A = 5) →(U= 5)
2. (A = 5) →((A = 10)→(U=0))
3. □(((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))) → (A = 5)
4. □(((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))) → ((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))
5. ((A = 5) → (U = 5)) ∧ ((A = 10) → (U = 0))
6. A = 5

(I have probably butchered the parentheses sorry, I'm on mobile)

The key to this argument seems to be step 3, which is the step that seems flawed to me. Informally this is stating "If I prove that choosing \$5 leads to \$5 and choosing \$10 leads to \$0, then I will choose \$5". This seems like a reasonable claim and is what allows the vacuously true conditionals that cause the whole Löbian troubles, but I don't see how an agent will be able to prove this. Step 3 seems like it can be interpreted in 2 ways, neither of which appear to work:

1. This seems to me to be the more natural interpretation. The agent can prove (3.) because it can prove of itself that, when it finds a proof showing that it gains more utility on one path, it will halt its proof search and take the higher-utility path. So the agent reasons that, if "((A = 5) → (U = 5)) ∧ ((A = 10) → (U = 0))" is provable then it will eventually find this proof, halt, and output \$5. But this halting aspect introduces a difficulty in that the agent must prove not only that the above is provable, but that no contradictory proof will be found earlier in the proof search. Even if the above is provable, if there is a proof earlier in the ordering of the statement "((A = 5) → (U = 0)) ∧ ((A = 10) → (U = 10))" (or any other statement of this form where \$10 is valued higher than \$5) then (3.) is false. The agent needs to be able to prove not only that the proof exists but that it will find this proof before it proves something else that halts it looking at proofs. The obvious way to guarantee this won't happen is for the agent to prove that if it any statement has a proof it doesn't have a shorter disproof, but of course it can't prove this. If the agent can't guarantee this won't happen, then the conditional seems to fail as they only assume the existence of the proof somewhere, and it could be gated behind other proofs which cause different actions.
2. The agent can prove (3.) because it models itself as explicitly search for a proof of the specific statement "((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))". Therefore it doesn't matter if there are shorter contradictory proofs, the agent will just ignore them and keep going, only halting and returning an answer if it proves this specific statement. In this case it can indeed prove (3.), but then I'm not sure this is an interesting result anymore. We now have an agent which isn't looking for proofs about how to maximise utility and taking actions based on that, but instead has been hard-coded to look for a proof of a specific arbitrary statement, and only do something if it finds such a proof. In fact I think this agent can never output \$10 - it'll either output \$5 or run forever - any proof that it outputs \$10 won't constitute a proof of "((A = 5) → (U = 5) ∧ ((A = 10) → (U = 0))" and therefore won't trigger a halt/decision.

It seems to me like the problem really arises from some ambiguity/sleight of hand around whether we're using "the agent proves P" to mean "P is provable in the theory T which represents the agent's output" or "the agent finds a proof of P in a specific computable search over the proofs of T", which might differ greatly depending on how the agent is set up to choose actions. On the most natural interpretation these two are very different, and the former doesn't imply the latter - the latter being what determines what the agent actually does. On the other (IMO less natural) interpretation, we have a pretty weird agent that's basically been doomed to take at most \$5, so its behaviour isn't that strange

New Comment

The core of the 5-and-10 problem is not specific to a particular formalization or agent algorithm. It's fundametally the question of what's going on with agent's reasoning inside the 5 world. In the 10 world, agent's reasoning proceeds in a standard way, perhaps the agent considers both the 5 and 10 worlds, evaluates them, and decides to go with 10. But what might the agent be thinking in the 5 world, so that it ends up making that decision? And if the agent in the 10 world is considering the 5 world, what does the agent in the 10 world think about the thinking of the agent in the 5 world, and about what that implies in general?

How this happens is a test for decision making algorithms, as it might lead to a breakdown along the lines of the 5-and-10 problem, or to a breakdown of an informal model of how a particular algorithm works. The breakdown is not at all inevitable, and usually the test can't even be performed without changing the algorithm to make it possible, in which case we've intentionally broken the algorithm in an interesting way that might tell us something instructive.

In the post, what agent algorithm are you testing? Note that agent's actions are not the same thing as agent's knowledge of them. Proving A = 5 in a possibly inconsistent system is not the same thing as actually doing 5 (perhaps the algorithm explicitly says to do 10 upon proving A = 5, which is the chicken rule; there is no relevant typo in this parenthetical).

Yeah sure, like there's a logical counterfactual strand of the argument but that's not the topic I'm really addressing here - I find those a lot less convincing so my issue here is around the use of Lobian uncertainty specifically. There's an step very specific to this species of argument that proving that □P will make P true when P is about the outcomes of the bets, because you will act based on the proof of P.

This is invoking Lob's Theorem in a manner which is very different from the standard counterpossible principle of explosion stuff. And I'm really wanting to discuss that step specifically because I don't think it's valid, and if the above argument is still representative of at least a strand of relevant argument then I'd be grateful for some clarification on how (3.) is supposed to be provable by the agent, or how my subsequent points are invalid.