# Ω 7

Personal Blog
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In an earlier discussion about the Troll Bridge problem, Abram mentioned a Lobian proof that a logical induction based agent would converge to not crossing the bridge.

It looked rather sketchy, though, and further discussion by Paul in the comments revealed that Troll Bridge might be passable.

(as a quick refresher, Troll Bridge is the decision problem where action 1, ie, attempting to cross the bridge, will result in crossing the bridge for a utility of 1, unless you tried to cross the bridge because your exploration step was active, in which case, the troll blows up the bridge, and you get a utility of 0. Refraining from attempting to cross the bridge, which is action 2, will give a utility of 0.5, more about it will be typed up soon.)

Results:

The variant of logical induction that Abram mentioned, where the probabilities jump to 100% as soon as something gets proved, probably converges to failure on Troll Bridge.

The vanilla variant of logical induction that was discussed in the Logical Induction paper has an unexpected and novel sort of Lobian loophole that appears to allow converging to failure on Troll Bridge, and also might be usable to introduce the spurious counterfactual problem from proof-based decision theory into the logical inductor decision theory setting. This same loophole also renders the "proof" that different logical inductors may converge to different actions in 5-and-10 (mentioned here) invalid, and I strongly suspect that sufficiently interesting proof orderings may be used to make any given logical inductor converge to taking either the 10 dollars or the 5 dollars, and vice-versa.

There's a new variant of logical induction I came up with (a minor and somewhat hacky variant) that allows proofs about "you could have an enforcing trader that makes large bets on getting certain outcomes" to go through, and in particular, it seems to allow victory on troll bridge if you set up the initial set of traders right. (modulo a few very reasonable assumptions, and one thing that I wasn't quite able to prove but seems extremely likely) It's a custom-designed and horribly hacky solution, and I'd expect the True Solution to involve (somehow) dropping exploration steps entirely, but the process of constructing it did turn up that unexpected loophole mentioned in the above paragraph, and also was good for training an ability to prove bounds about what a logical inductor does, so I might as well present it. Fair warning, the appendix involves a lot of digging into the fine details of the logical inductor algorithm.

Setup: is the 'th trade of the trader equipped with a "budget filter" of , against the market . It is defined as follows:

if: for some , and , then . Otherwise,

Also, is the total number of shares traded on that turn, it acts as an upper and lower bound on the value of the trade, as assessed by any possible world. You'll notice that this definition looks very similar to the one from the logical induction paper, but it's actually doing something rather different.

This particular modification to the budgeting gets around that. The agent is only allowed to move a certain number of shares at a time, corresponding to its score and its allowable budget, and past that, the trade will be scaled down. Now, this limit on the number of shares moved means that the resulting market is actually vulnerable to some poly-time traders that lay very rapidly-increasing bets on settled sentences to exploit the (narrowing, but too slowly) gap between what was proved and what the beliefs are. Well, if this modification to the budgeting leaves the market open to being exploited by a poly-time trader, then it can't be a logical inductor!

However, interestingly enough, literally every single theorem/good property in the logical induction paper can be proved with this modification on the budgeting. Why? Well, we only get inexploitability guarantees against traders that keep the magnitude of their trades smaller than their minimum possible wealth, plus some constant. Pretty much all the traders in the logical induction paper have this property, as an accidental side effect of making sure they didn't go unboundedly into debt. The -ROI lemma has this property, and this propagates forward to net every single theorem that calls upon the -ROI lemma. Theorem 4.5.10 also has this property, and those two are enough to immediately reprove all theorems outside of those in section G. The trading strategy in Lemma G.1.1 preserves this property as long as the magnitude of all the component traders is uniformly bounded (by 1, let's say), and the definition of all the other traders used in section G can be modified to both exploit the market (if it's exploitable), and only move at most 1 share per turn.

So, although this modification to the budgeted traders, and the resulting sequence of market states made by approximate fixed-pointing technically isn't a logical inductor, it has all the nice properties we proved about one.

Proof: The set of traders will be composed of an "enforcer" , and a "supertrader" defined in the usual way, and the market state will be arrived at by the usual fixed-point procedure. We will be interpreting the sequence of rounds as a game between and . The score of a player in this game is the magnitude of the largest trade they are theoretically able to make. The enforcer, equipped with a starting budget of dollars, is defined as follows:

is an indicator function that's 0 if n is before an "awakening day" and 1 afterward.

Let's go over some properties of . Because it's budgeted at , its starting score is . By the way the budgeting operator works, if it makes a trade, the score will drop by the worst-case value of that trade. Looking at it, what it does is start buying any components of that are priced below 51 cents, and it'll escalate to spending its entire starting endowment by the time the price on any of the components drop to 50.5 cents. Thus, if doesn't spend its entire budget or starting endowment, whichever is lower, on a turn, the price of all components will be 50.5 cents and above, so . There is some day by which forevermore, so, assuming that this trader never escalates to spending all its money, and "wakes up" after that day, as implemented by that indicator function , it will be able to ensure that the agent always attempts and succeeds at crossing the bridge (except on exploration steps).

So, now all that's left is showing that this trader will (with high probability), not escalate to spending all its money. Because we are taking an (approximate) fixed-point of and the supertrader , and at the fixed-point, the total number of shares bought and sold is (approximately) zero, the magnitude of the enforcer's trade is (approximately) equal to the magnitude of the supertrader's trade. (it's late at night and I'm not feeling like accounting for this and it almost certainly doesn't make a difference). As this trader will escalate to whatever level of trading is required to keep the expected value of crossing the bridge above .505, the supertrader can only force a loss/budget overdraw once its score/maximum allowable trade rises above that of the enforcers.

Now, the game is as follows. On turn , the enforcer and supertrader have scores of and respectively, the latter is by Lemma 1. (well, technically, this is an upper-bound on the score of the supertrader, but it'll do). Then, they lay their bets. Two things are important to note. One is that we'll be making a simplifying assumption where the value of the utility is completely uncertain until a proof of the utility gained on that day is observed in the theorem prover, this shouldn't change things too much. Thus, if someone spent 5 dollars on shares saying that the value of for varying values of k, the lowest possible value would be -5 and the highest possible value would be the number of purchased shares-5 before an observation of the value of came in, and that would immediately settle all the component bets. The second thing to note is that will be defined as the set of days and before where a trade was made that now has had its value revealed because an observation of the utility gained on that day has made it to the theorem prover on day , and it wasn't present on day n. It is then easy to verify that the change in score from day n to day n+1 is as follows.

Now, we'll impose one further key assumption. The proof-searching process must be either unbiased or biased in favor of crossing, ie, the sequence of proofs of the form "on day n, the bridge was crossed/not crossed" is a psuedorandom sequence with probability or less of the proof being "exploration step into the bridge occurred". It's possible to screw over the enforcer by deferring information of successful bridge-crossings to some really sparse subsequence, but making proofs of bridge-explosions denser, so this basically says that you get information about passing or not-passing the bridge in a way that isn't biased towards particular outcomes. Direct observation of the utility on day n immediately after the n'th trade is made and the n'th action is selected is a very easy way to get this property, but this generalization handles delays in feedback.

So, now that all these preliminaries are out of the way, we can see that the the reveal of theorems from day to day is always negative-expected-value for and positive-expected value for , and the most recent trade only locks up slightly more value from than from . Conditional on a particular trade being in , the expected value of the trade to (remember, we're assuming the pseudorandom sequence of exploration steps is truly random and taking an expectation over that) is bounded below by , and the expected value of the trade to is bounded above by . (we don't account for the trader exploring into remaining on its side of the bridge because the conditional contract that bought and sold has 0 value in that world) Also note that if settles the most recent trade, the same analysis applies to it, and if doesn't settle the most recent trade, then is at minimum and is at maximum .

This may seem quite complicated, but the essence of it is that these two traders are playing a game. The super-trader can sell shares in "the bridge will be succesfully crossed" for at most 51 cents. However, until the share has its value assessed, the super-trader is only allowed to sell a finite amount of them to the enforcer (because score is tied to the maximum number of shares that can be moved). Also, there's a chance that the share will turn out be worth 1 dollar, so this action is negative-expected value for the supertrader.

I'm not sure of the exact result required to prove this, but there are results under the name of "Gambler's Ruin" that any gambler playing a negative-expected-value game will eventually go broke regardless of strategy. I'm extremely confident that there's a similar result that any gambler playing a negative-expected-value game against another player with a sufficiently larger bankroll will (with very high probability) fail to ever attain the same wealth as the other player.

Thus, we have showed that, assuming unbiased finding of theorems, and assuming the gambler's ruin conjecture above goes through, there's a value of that can be assigned to the enforcer to ensure with arbitrarily high "probability" that the bridge will be crossed forevermore after the enforcer wakes up, except on exploration steps.

Lemma 1

The maximum magnitude trade that can make is bounded above by .

Alright, time for some definitions. As a quick recap:

if: for some , and , then . Otherwise,

The budgeted version of a trader on day n is , and the resulting trade they make is . This is defined as

is some probability measure over such that is finite. A popular choice is the inverse exponential base 2, as was used in the original logical induction paper. This makes , but it generalizes. Pretty much, the budgeted version of a trader is when you take a trader and weighted-sum together every possible budget level "filter" on them.

The score of a budgeted trader on day n is . The "debt level" of a budgeted trader on day n is the largest budget such that all traders with a smaller budget go bankrupt.

So, to begin with, note that

By the definition of ,

Now, we can split it up into two cases. In case one, for all ,

And thus, by the definition of . In the second case, there is a which minimizes to less than , thus by the definition of , we get

Therefore, , and we get the bound

There's one piece , which looks as if it could be unbounded although it has a dependence on the minimum wealth of the trader, and another piece which is bounded by a constant by the definition of . The first term will now be bounded.

The supertrader works by finding a fixed point over . This particular proof will work in a very similar way as Vadim's earlier proof of a bound relating and for any trader in a logical inductor. The only differences are that we will explicitly account for the imperfect fixed-point finding, and also there's a second component that isn't a supertrader so we need to account for the flow of shares and cash in and out.

The argument for the bound is fairly simple. In the fixed point, the number of shares traded on each side of the trade must cancel out to (approximately) 0, unless the probability/price is 0 or 1. For all the worlds, the winners are balanced out by losses elsewhere. Also, by the way the budgeting works, there's a finite amount of points that can be earned from driving a particular opponent unboundedly into debt. Call the trade made by the entire logical inductor on day n . It is defined by: (note that is a probability distribution over all turing machines)

To begin with, because of the approximate fixed-poitn finding of the logical induction algorithm,

Now, given an arbitrary n and an arbitrary world, we can divide the traders into two groups, which has positive value as assessed by the world, and which has negative value as assessed by the world.

Fix a particular day , and consider the world in which maximizes . In short, we are assuming that the minimum possible amount of value has been given away by the enforcer. By the above inequality,

That third term is negative, so making it as negative as possible (this corresponds to all the losing traders going infinitely into debt), gives an upper bound on the value of all the winning traders. The budgeted traders cannot go infinitely into debt, because they shrink down their trading magnitude each time they lose value. The maximum debt a trader can have is , which we already established was a constant. Thus,

Seperately,

and

By putting all these inequalities together, we get

As a quick reminder,

Now, we will put all this together to bound the number of trades that a logical inductor can make on a given turn.

As we can see, the terms with are constants by definition. The term with on the left was bounded above by 1-+a constant. The term with on the right is negative because of how we defined the set .

By setting to , we get

# Ω 7

Mentioned in
New Comment

Thanks for this post, but I don't feel like I have the background for understanding the point you're making. In the pingback post, Demski describes your point as saying:

an agent could reason logically but with some looseness. This can fortuitously block the Troll Bridge proof.

Could you offer a high level explanation of what the main principle here is and what Demski means by looseness? (If such an explanation exists. )Thanks.

The key piece that makes any Lobian proof tick is the "proof of X implies X" part. For Troll Bridge, X is "crossing implies bridge explodes".

For standard logical inductors, that Lobian implication holds because, if a proof of X showed up, every trader betting in favor of X would get free money, so there could be a trader that just names a really really big bet in favor of the X (it's proved, after all), the agent ends up believing X, and so doesn't cross, and so crossing implies bridge explodes.

For this particular variant of a logical inductor, there's an upper limit on the number of bets a trader is able to make, and this can possibly render the statement "if a proof of X showed up, the agent would believe X" false. And so, the key piece of the Lobian proof fails, and the agent happily crosses the bridge with no issue, because it would disbelieve a proof of bridge explosion if it saw it (and so the proof does not show up in the first place).