Back in my first post about ADT, I conjectured that might achieve sublinear regret relative to argmaxing for the true embedder, if some extra condition was imposed.
In the form of stated in that post, the game of chicken is a counterexample to Conjecture 1. In short, the "delusion" embedder (where the opponent is an identical agent) has argmax converge to going straight (because it will, in expectation, yield utility, while other strategies yield less utility.) Further, if the agent uses the delusion embedder less than of the time, then will start being greater than (where is the embedder that plugs the chosen action into both your action and your opponent's action), so it will start using the delusion embedder more often. Therefore, of the time, the agent rams into itself because it uses the delusion embedder.
Now, the reason why this embedder can't be removed is interesting. We can consider the agent to be reasoning as follows: "I don't know whether I'll pick the delusion embedder or the copy embedder. There's a chance that an agent will pick the copy embedder , in which case will score highly, because will just go straight, and picks swerving with probability. Also, is around , because about of the time, my foe picks the copy embedder and so do I, and about of the time, my foe picks the delusion embedder and so do I, and we ram into each other."
However, note that (approximately ) systematically deviates from (approximately ). This is pretty much Abram's criterion for CDT and EDT diverging. Plugging an action into the embedder tends to give a different result than plugging yourself into the embedder and conditioning on you taking that action. The second quantity is a more accurate estimate of the utility you can expect to get. However, you can't just use it in place of because the conditional may give nonsensical answers if you almost certainly don't take the action, because conditionals on improbable events are less accurate. To rephrase this, using the conditional is vulnerable to the futarchy hack, where a trader with a lot of wealth bets on conditional contracts driving down the value of , so the agent doesn't use embedder , so dollars are lost on the conditional contracts, and the wealthy trader can repeat the same move forevermore and unilaterally prevent ever trying out embedder .
Is there some way to address this? Yes. As stated in the comments of the last post on ADT, it is possible to improve the reality filter by switching from to (using as an abbreviation for , because this means that embedder was chosen)
In short, the conditionals may be inaccurate, but if they are inaccurate, then must be low, so the expected values being inaccurate is compensated for by being multiplied by a number close to zero.
It is possible to use the same sort of trick on the divergence between plugging in an action and plugging in yourself and conditioning on an action, and require
for the embedder to be considered. This is a filter. Note that this condition isn't a hard constraint on all embedders, it just applies to embedders that are chosen with non-negligible probability. If a particular embedder is definitely not taken, the multiplication by a near-zero term will make the embedder pass the filter, and (the utility) will be used as the utility for the embedder, since there's no conditional utility to defer to. However, if an embedder has non-negligible probability of being taken, and passes this filter, then , which is pretty much equivalent to using the utility.
With the improved reality-filter, and the filter, we can arrive at a proof of Conjecture 1, modulo some extra conditions.
First, we must assume that the hypothesis space contains a "true environment" , which is defined as follows. First, it passes the reality-filter (ie, the reality-filter converges to ). Second, it passes the filter (that converges to ). Note that if the series of utilities seen is acquired by feeding with a uniformly random bitstring on each turn, it will pass the reality-filter by an adaptation of Theorem 1 from the last post, by Kelly-betting on the exploitable difference between and , where the timely appearance of and the actual action in the deductive process lets the conditional bets be resolved quickly. The second condition isn't acquired by default, but it appears that in the known toy examples games where , there is another environment that is equally accurate where the choice of action influences the opponent and in that environment.
Second, it's possible that the environment could be a perverse environment with clustering behavior, so doesn't limit to , because feedback on what is may be quite sparse. Therefore, we have to assume that the true environment (whatever it may happen to be) is sufficiently "nice", in the sense that the time-average of the difference between the expectation of and itself limits to zero.
The third condition is very similar, that limits to zero. It's more plausible than the first assumption, because (an interval containing) the true value of appears in the deductive process in a timely manner by assumption, but I haven't found a way to implement Kelly-betting to exploit this, because the traders don't necessarily have access to the deductive process, or know their own worst-case value. It should be noted that the traders not knowing their own worst-case value is the property that necessitates emulating the entire logical inductor to implement budgeting, in the key -ROI lemma of the logical induction paper. Therefore, it makes sense to stipulate that traders are given their own worst-case value as an advice string, to streamline budgeting, and if the logical inductor is augmented in this way, this condition is attained for free via a Kelly-betting argument.
We will get a necessary and simple lemma out of the way first.
Lemma 1: When there is a finite set of sequences of events , which are mutually exclusive, and a sequence of LUV's , then
Proof: By the definition of , this is equivalent to the statement that , where the brackets are the Iverson brackets, which is just the indicator function that is if the statement is true, and if the statement is false.
Now, notice that the bounded combination sequences and have provably equal values for all , so by Theorem 4.5.4 in the logical induction paper, Affine Provability Induction, the prices on these sequences will limit to being equal. By the definition of expected value, the prices on these sequences are the prices of and , respectively.
Now to prove sublinear regret.
Theorem 1: If there is an embedder in the hypothesis class of an agent which is up against an environment such that
1: (utility is learned)
2: (argmax behavior of is learned)
( passes the reality filter)
( passes the filter),
then the agent on the true environment has sublinear regret relative to the argmax agent on , ie, .
Proof: To begin with, by applying assumptions 1 and 2, we only need to prove that . In fact, we will prove a stronger result, that (which immediately implies the desired result).
Assume this is false, so infinitely often for some . Also, let be the set of embedders with greater than probability of being chosen on turn . By Lemma 1,
In case 1, that clearly fails the filters (ie, one of the two filters has the quantity inside of the filter be above or , as the case may be). Also . Then in this case, it is possible for a trader to bet against being selected, because clearly failing a filter can be picked out by a continuous indicator function. won't be selected because its value is , due to failing both filters, and passes both filters by conditions 3 and 4, and claims that it can attain value bounded away from . This bet allows the trader to pick up a free dollars or more (due to the probability of being selected being bounded away from due to being in ). Because this is a logical inductor, this case can only occur finitely many times.
In case 2, that clearly fails a filter, also . In that case the regret is at most . Therefore, all but finitely many of the times where occur in case 3.
Now, let be the set of plausible environments that don't clearly fail a filter. In case 3, these two sets are equal. By the reality-filters not being clearly failed, and the reality-filters enforcing the approximate equality of and , for all and limiting to , we get that
Then, by the filters not being clearly failed, which enforce the approximate equality of and , we get that
Now, by conditions 3 and 4, the filters for must limit to , so on all but finitely many turns, for all , (otherwise it is possible to bet against being chosen and get free money by the argument from case 1). Thus, we get that
Because we are in case 3 where , and the probabilities of the embedders being selected approximately add up to , and the total probability mass outside of is at most by the definition of , we get that
And putting it all together, we get that , which contradicts the assumption that infinitely often.
Therefore, holds, which implies the theorem by assumptions 1 and 2.
Additional Discussion and Generalizing to Expanding Hypothesis Spaces:
There's an interesting feature of this. In the proof, only part of the reality-filter was actually used, and the part that was used can be combined with the filter to yield a single filter that suffices to prove sublinear regret on the true environment, which is:
However, this doesn't enforce that embedders accurately predict what happens when other embedders are selected by , so it glosses over important information which can be used to rule out an embedder.
The obvious next question is whether it's possible to generalize the proof to a space of hypotheses that expands over time. We might hope to get a result that, for all computable environments that are "sufficiently nice", there is an agent with a sufficiently slow-growing hypothesis space that will eventually learn to act in accordance with the true environment. This runs into several obstacles.
Obstacle 1: The proof of Lemma 1 fails because the batch of statements of the form stops being a bounded combination sequence (because, all in all, the combination is composed of shares in total, which isn't bounded when grows over time).
Obstacle 2: The argument for Case 1 occuring finitely often (that failure allows you to pump free money out and if this recurs infinitely often you get infinite money) fails, because the amount of money you can pump out is , but when grows over time, the money you can pump out lessens over time, in a way that may force convergence. I suspect it's possible to address this by shooting for a proof that even when the hypothesis space grows over time, it's possible to exploit the market unless the total probability mass on embedders that clearly fail their filters limits to zero.
Obstacle 3: The asymptotic equality linking and occurs because the difference between those two is bounded by , which limits to zero because and limit to zero. When grows over time, it requires and to shrink faster to get this to work out. So if you're adding one new hypothesis each turn, then and must shrink faster than . And if and shrink too fast, then the true environment can't pass the reality filter either. So this means can't grow too fast.
Obstacle 4: This appears to be the largest obstacle. It's really hard to guarantee that the true environment passes the reality-filter. Even with a finite hypothesis space, the reality-filter "score" does limit to zero (although I haven't checked the proof particularly well), but going for any result along the lines of "this rate of approaching zero/level of strictness for the reality-filter is passed by the true environment" causes the proof to fail. It might even be uncomputably slow for all I know. As far as I can tell, this problem gets even worse when the size of the hypothesis class grows over time, and it might cause a failure of the original proof of the reality-filter "score" limiting to zero. It seems like the hard part is being shoved into the true environment being able to pass the filters in the first place. And limits on how fast can converge to zero translate directly into limits on how fast can grow by the third obstacle.