Troll Bridge

by abramdemski 1mo23rd Aug 201923 comments


Ω 22

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

All of the results in this post, and most of the informal observations/interpretations, are due to Sam Eisenstat.

Troll Bridge is a decision problem which has been floating around for a while, but which has lacked a good introductory post. The original post gives the essential example, but it lacks the "troll bridge" story, which (1) makes it hard to understand, since it is just stated in mathematical abstraction, and (2) makes it difficult to find if you search for "troll bridge".

The basic idea is that you want to cross a bridge. However, there is a troll who will blow up the bridge with you on it, if (and only if) you cross it "for a dumb reason" — for example, due to unsound logic. You can get to where you want to go by a worse path (through the stream). This path is better than being blown up, though.

We apply a Löbian proof to show not only that you choose not to cross, but furthermore, that your counterfactual reasoning is confident that the bridge would have blown up if you had crossed. This is supposed to be a counterexample to various proposed notions of counterfactual, and for various proposed decision theories.

The pseudocode for the environment (more specifically, the utility gained from the environment) is as follows:

IE, if the agent crosses the bridge and is inconsistent, then U=-10. (□⊥ means "PA proves an inconsistency".) Otherwise, if the agent crosses the bridge, U=+10. If neither of these (IE, the agent does not cross the bridge), U=0.

The pseudocode for the agent could be as follows:

This is a little more complicated, but the idea is supposed to be that you search for every "action implies utility" pair, and take the action for which you can prove the highest utility (with some tie-breaking procedure). Importantly, this is the kind of proof-based decision theory which eliminates spurious counterfactuals in 5-and-10 type problems. It isn't that easy to trip up with Löbian proofs. (Historical/terminological note: This decision theory was initially called MUDT, and is still sometimes referred to in that way. However, I now often call it proof-based decision theory, because it isn't centrally a UDT. "Modal DT" (MDT) would be reasonable, but the modal operator involved is the "provability" operator, so "proof-based DT" seems more direct.)

Now, the proof:

  • Reasoning within PA (ie, the logic of the agent):
    • Suppose the agent crosses.
      • Further suppose that the agent proves that crossing implies U=-10.
        • Examining the source code of the agent, because we're assuming the agent crosses, either PA proved that crossing implies U=+10, or it proved that crossing implies U=0.
        • So, either way, PA is inconsistent -- by way of 0=-10 or +10=-10.
        • So the troll actually blows up the bridge, and really, U=-10.
      • Therefore (popping out of the second assumption), if the agent proves that crossing implies U=-10, then in fact crossing implies U=-10.
      • By Löb's theorem, crossing really implies U=-10.
      • So (since we're still under the assumption that the agent crosses), U=-10.
    • So (popping out of the assumption that the agent crosses), the agent crossing implies U=-10.
  • Since we proved all of this in PA, the agent proves it, and proves no better utility in addition (unless PA is truly inconsistent). On the other hand, it will prove that not crossing gives it a safe U=0. So it will in fact not cross.

The paradoxical aspect of this example is not that the agent doesn't cross -- it makes sense that a proof-based agent can't cross a bridge whose safety is dependent on the agent's own logic being consistent, since proof-based agents can't know whether their logic is consistent. Rather, the point is that the agent's "counterfactual" reasoning looks crazy. Arguably, the agent should be uncertain of what happens if it crosses the bridge, rather than certain that the bridge would blow up. Furthermore, the agent is reasoning as if it can control whether PA is consistent, which is arguably wrong.

Analogy to Smoking Lesion

One interpretation of this thought-experiment is that it shows proof-based decision theory to be essentially a version of EDT, in that it has EDT-like behavior for Smoking Lesion. The analogy to Smoking Lesion is relatively strong:

  • An agent is at risk of having a significant internal issue. (In Smoking Lesion, it’s a medical issue. In Troll Bridge, it is logical inconsistency.)
  • The internal issue would bias the agent toward a particular action. (In Smoking Lesion, the agent smokes. In Troll Bridge, an inconsistent agent crosses the bridge.)
  • The internal issue also causes some imagined practical problem for the agent. (In Smoking Lesion, the lesion makes one more likely to get cancer. In Troll Bridge, the inconsistency would make the troll blow up the bridge.)
  • There is a chain of reasoning which combines these facts to stop the agent from taking the action. (In smoking lesion, EDT refuses to smoke due to the correlation with cancer. In Troll Bridge, the proof-based agent refuses to cross the bridge because of a Löbian proof that crossing the bridge leads to disaster.)
  • We intuitively find the conclusion nonsensical. (It seems the EDT agent should smoke; it seems the proof-based agent should not expect the bridge to explode.)

Indeed, the analogy to smoking lesion seems to strengthen the final point -- that the counterfactual reasoning is wrong.

I plan to revise this post at some point, because I'm leaving some stuff out. Planned addendums:

  • It might seem like the agent doesn't cross the bridge because of the severe risk aversion of proof-based DT. However, there is a version of the argument which works for agents with probabilistic reasoning, illustrating that the problem is not solved by assigning a low probability to the inconsistency of PA. Agents will not cross the bridge even if the cost of the explosion is extremely low, just slightly worse than not crossing. This makes the problem look much more severe, because it intuitively seems as if putting low probability on inconsistency should not only block the "weird" counterfactual, but actually allow an agent to cross the bridge.
  • Furthermore, there's a version which works for logical induction. This shows that the problem is not sensitive to the details of how one handles logical uncertainty, and indeed, is still present for the best theory of logical uncertainty we currently have.


Ω 22