Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Asymptotic Decision Theory (Improved Writeup)

15jessicata

12jessicata

1Diffractor

4jessicata

3Diffractor

2jessicata

4jessicata

3Diffractor

2jessicata

1Diffractor

4jessicata

1mako yass

1jessicata

1Diffractor

New Comment

14 comments, sorted by Click to highlight new comments since: Today at 10:24 AM

OK, I checked this more and I'm more suspicious now.

First: in the ADT paper, the asymptotic dominance argument is about the *limit* of the agent's action as epsilon goes to 0. This limit is not necessarily computable, so the embedder can't contain the agent, since it doesn't know epsilon. So the evil problem doesn't work. The optimality proof might be valid. I didn't understand which specific step you thought was wrong.

Second: This also means the chicken problem is ill-typed, since you can't put an ADT in the environment. But there is a well-typed version where you give the same epsilon to both ADT agents, and the embedders have epsilon hard-coded.

Consider the following embedder. According to this embedder, you will play chicken against ADT-epsilon who knows who you are. When ADT-epsilon considers this embedder, it will always pass the reality filter, since in fact ADT-epsilon is playing against ADT-epsilon. Furthermore, this embedder gives NeverSwerveBot a high utility. So ADT-epsilon expects a high utility from this embedder, through NeverSwerveBot, and it never swerves.

in the ADT paper, the asymptotic dominance argument is about thelimitof the agent's action as epsilon goes to 0. This limit is not necessarily computable, so the embedder can't contain the agent, since it doesn't know epsilon. So the evil problem doesn't work.

Agreed that the evil problem doesn't work for the original ADT paper. In the original ADT paper, the agents are allowed to output distributions over moves. I didn't like this because it implicitly assumes that it's possible for the agent to perfectly randomize, and I think randomization is better modeled by a (deterministic) action that consults an environmental random-number generator, which may be correlated with other things.

What I meant was that, in the version of argmax that I set up, if is the two constant policies "take blank box" and "take shiny box", then for the embedder where the opponent runs argmax to select which box to fill, the argmax agent will converge to deterministically randomizing between the two policies, by the logical inductor assigning very similar expected utility to both options such that the inductor can't predict which action will be chosen. And this occurs because the inductor outputting more of "take the blank box" will have converge to a higher expected value (so argmax will learn to copy that), and the inductor outputting more of "take the shiny box" will have converge to a higher expected value (so argmax will learn to copy that).

The optimality proof might be valid. I didn't understand which specific step you thought was wrong.

So, the original statement in the paper was

It must then be the case that for every . Let be the first element of in . Since every class will be seperated by at least in the limit, will eventually be a distribution over just . And since for every , , by the definition of it must be the case that .

The issue with this is the last sentence. It's basically saying "since the two actions and get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero", which is false

And it is specifically disproved by the second counterexample, where there are two actions that both result in 1 utility, so they're both in the same equivalence class, but a probabilistic mixture between them (as converges to playing, for all ) gets less than 1 utility.

Consider the following embedder. According to this embedder, you will play chicken against ADT-epsilon who knows who you are. When ADT-epsilon considers this embedder, it will always pass the reality filter, since in fact ADT-epsilon is playing against ADT-epsilon. Furthermore, this embedder gives NeverSwerveBot a high utility. So ADT-epsilon expects a high utility from this embedder, through NeverSwerveBot, and it never swerves.

You'll have to be more specific about "who knows what you are". If it unpacks as "opponent only uses the embedder where it is up against [whatever policy you plugged in]", then NeverSwerveBot will have a high utility, but it will get knocked down by the reality filter, because if you converge to never swerving, will converge to 0, and the inductor will learn that so it will converge to assigning equal expected value to and, and converges to 1.

If it unpacks as "opponent is ADT-epsilon", and you converge to never swerving, then argmaxing will start duplicating the swerve strategy instead of going straight. In both cases, the argument fails.

In the original ADT paper, the agents are allowed to output distributions over moves.

The fact that we take the limit as epsilon goes to 0 means the evil problem can't be constructed, even if randomization is not allowed. (The proof in the ADT paper doesn't work, but that doesn't mean something like it couldn't possibly work)

It's basically saying "since the two actions A and A′ get equal expected utility in the limit, the total variation distance between a distribution over the two actions, and one of the actions, limits to zero", which is false

You're right, this is an error in the proof, good catch.

Re chicken: The interpretation of the embedder that I meant is "opponent only uses the embedder where it is up against [whatever policy you plugged in]". This embedder does not get knocked down by the reality filter. Let be the embedder. The logical inductor expects to equal the crash/crash utility, and it also expects to equal the crash/crash utility. The expressions and are provably equal, so of course the logical inductor expects them to be the same, and the reality check passes.

The error in your argument is that you are embedding actions rather than agents. The fact that NeverSwerveBot and ADT both provably always take the straight action does not mean the embedder assigns them equal utilities.

Wasn't there a fairness/continuity condition in the original ADT paper that if there were two "agents" that converged to always taking the same action, then the embedder would assign them the same value? (more specifically, if , then ) This would mean that it'd be impossible to have be low while is high, so the argument still goes through.

Although, after this whole line of discussion, I'm realizing that there are enough substantial differences between the original formulation of ADT and the thing I wrote up that I should probably clean up this post a bit and clarify more about what's different in the two formulations. Thanks for that.

Yes, the continuity condition on embedders in the ADT paper would eliminate the embedder I meant. Which means the answer might depend on whether ADT considers discontinuous embedders. (The importance of the continuity condition is that it is used in the optimality proof; the optimality proof can't apply to chicken for this reason).

The most prominent undesirable feature of this is that it's restricted to a finite set of embedders. Optimistic choice fails very badly on an infinite set of embedders, because we can consider an infinite sequence of embedders that are like "pressing the button dispenses 1 utility forever", "pressing the button delivers an electric shock the first time, and then dispenses 1 utility forever"... "pressing the button delivers an electric shock for the first n times, and then dispenses 1 utility forever"... "pressing the button just always delivers an electric shock". Optimistic choice among embedders keeps pressing the button, because, although it keeps getting shocked, there's always an embedder that promises that the agent's fortunes will turn around on the next turn.

Seems like infinite bandit algorithms should work here? Basically just give more complex embedders some regularization penalty.

I got an improved reality-filter that blocks a certain class of environments that lead conjecture 1 to fail, although it isn't enough to deal with the provided chicken example and lead to a proof of conjecture 1. (the subscripts will be suppressed for clarity)

Instead of the reality-filter for being

it is now

This doesn't just check whether reality is recovered on average, it also checks whether all the "plausible conditionals" line up as well. Some of the conditionals may not be well-formed, as there may be conditioning on low-or-zero probability events, but these are then multiplied by a very small number, so no harm is done.

This has the nice property that for all "plausibly chosen embedders" that have a probability sufficiently far away from 0, all embedders and that pass this reality filter have the property that

So all embedders that pass the reality filter will agree on the expected utility of selecting a particular embedder that isn't very unlikely to be selected.

ADT (asymptotic decision theory) was an initial attempt at decision theory with logical inductors before the standard form (that has exploration steps), which is detailed in this post.

I'm confused by this sentence. First, what is the standard form? ADT was definitely invented after logical EDT with exploration (the thing you link to). Second, why do you link to a post on logical EDT and not to the ADT paper?

I think I remember the original ADT paper showing up on agent foundations forum before a writeup on logical EDT with exploration, and my impression of which came first was affected by that. Also, the "this is detailed in this post" was referring to logical EDT for exploration. I'll edit for clarity.

OK, I helped invent ADT so I know it conceptually came after. (I don't think it was "shortly after"; logical EDT was invented very shortly after logical inductors, in early 2016, and ADT was in late 2016). I think you should link to the ADT paper in the intro section so people know what you're talking about.

This does not pay up on counterfactual blackmail.

What's counterfactual blackmail?

EDIT: if you meant counterfactual mugging, I think one way to solve this is to use a low amount of computation power to select which agent to emulate, then use a high amount of computation power to run that agent. Of course, this is somewhat unsatisfying, since there isn't a canonical way of choosing how much less computation power to use.

ADT (asymptotic decision theory, initially detailed in this paper) was a proposed decision theory with logical inductors, developed after EDT with logical inductors and exploration steps. It is a possible candidate for conceptual progress in decision theory, but some basic questions about its performance are still unsettled, and there are several issues with the current implementation of it.

Definitions:A

marketis a computable function with type S→[0,1] (where S is the set of sentences of math) that will be denoted as P. We will be considering logical inductors in this setting, and in particular, each finite stage of a logical inductor is a market.Let A be a finite set of actions. An

agentis a computable sequence of computable functions of type P→A . It takes a timestep and a probability distribution, and selects a distribution over actions. Because logical inductors associate each timestep with an action, specifying the logical inductor and the agent specifies a sequence of actions. Because we will be assuming some fixed logical inductor in the background, we will suppress it in the notation and refer to the action produced by an agent at time t as at or bt. Randomization can be handled by using the diagonalization sentences used to define exploration (this acts in such a way that no trader in the inductor is able to reliably predict what the agent does), or it can be handled by having one of the actions be to call a random number generator in the environment.An

embedderis a sequence of randomized functions of type A→[0,1] (denoted by E or F, with the function at time t being denoted by Et or Ft). Let Et(at) denote the random variable that corresponds to the environment using a uniform random distribution of bitstrings for its randomness tape. An embedder must have the probability distribution of Et(at) being computably approximable on computable inputs.An

argmax agentis an agent that takes a finite set of other agents A, and a single embedder E, and outputs (argmaxa∈AEt(Et(at)))t on turn t. Et is defined as it usually is for logical inductors. Notice that, although Et may be very difficult to compute, putting it inside an expectation allows a logical inductor to output a decent guess as to what it is anyways. The agents all had to be computable in order for the argmax agent to duplicate their behavior at arbitrary turns. There is a dependence on the set A, the embedder E, and the logical inductor, but because the logical inductor and A will be treated as fixed, we will write the time-t action produced by the argmax agent as amEt.U will denote the "true environment", which is a sequence of values in [0,1].

Finally, the logical inductor will be required to have "fast feedback". This means that after each turn, the following information will show up in the deductive process at time t+1.

1: (an interval containing) the true value of Ut.

The deductive process is the sequence of theorems that the logical inductor sees. The reason that the true utility has to be in an interval, instead of reported directly, is because the true utility may have arbitrarily many digits, which hampers the ability of traders to use that information to judge bets on how U will turn out. This condition is present to ensure that, if U is generated by taking E and feeding it a uniform random bitstring, Et(Et(ADTt))≃tEt(U).

Theorem 1:IfUis generated on each turn by runningEt(ADTt)with a bitstring sampled uniformly at random, thenEt(Et(ADTt))≃tEt(U).We will apply the basic trading strategy from Theorem 4.5.9 in the logical induction paper. In this case, if the left-hand side is overpriced by ϵ infinitely often (by a failure of convergence, and the same argument can be applied symmetrically to the right-hand side being overpriced infinitely often), the traders buy a fraction of a share of U and sell a fraction of a share of Et(ADTt), and keep doing it until 1 share has been moved in total. This takes care of having P-generable trade magnitudes. According to the law of large numbers, with probability 1, there is a finite time after which all of the sub-traders have their pile of shares have a value within ϵ2 of 0, so the initial traders in the efficiently emulatable sequence of traders can be clipped off, and all the other sub-traders get ϵ or more value from selling the high-priced expectation and buying the low-priced one, so the necessary conditions for the ϵ-ROI Lemma are fulfilled and the resulting trader exploits the market.

Finally, we will define dominance. An agent a dominates an agent b on an embedder E if

limsupt→∞1t∑ti=1(Ei(b)−Ei(a))≤0

This could be thought of as a having sublinear regret relative to b.

What's ADT?:The asymptotic decision theory algorithm works as follows.

Inputs: A sequence of numbers asymptotically decreasing to 0, denoted as ¯¯¯δ, a finite list of embedders E, a finite list of agents A, and a logical inductor with fast feedback on the true environment, ¯¯¯P.

me:=┌ADT(¯¯¯δ,E,A,¯¯¯P)┐

REt:=Indδt(|Et(Et(met))−Et(Ut)|<δt)

ADTt:=amargmaxE∈E(REt⋅Et(Et(amEt)))t

Ok, so what does this do intuitively? Well, it takes all the embedders, and runs them through a "reality filter" REt which checks whether the embedder, run on the agent itself, replicates what the true reality U actually does, in expectation. For all the embedders that pass the reality filter, it uses the inductor to assess what score argmax gets in that embedder, takes the one that gets the highest score, and then implements argmax for that embedder. In short, it optimistically chooses the embedder where the highest score is attainable (that isn't wrong about reality), and optimizes for that. If it got a good score, that's fine. If it got a lower score than it was expecting, that embedder is less likely to pass the reality filter next time, because Et(met) undershot Ut. There isn't a problem (as there is in standard logical inductor decision theory) of systematically underscoring an embedder forever, because since an embedder is a randomized function, it's possible to actually approximate a distribution over what it outputs, so argmax will eventually catch on and start taking (predicted-to-be) optimal actions, and the value of these can also be empirically assessed.

Problems with ADT:The most prominent undesirable feature of this is that it's restricted to a finite set of embedders. Optimistic choice fails very badly on an infinite set of embedders, because we can consider an infinite sequence of embedders that are like "pressing the button dispenses 1 utility forever", "pressing the button delivers an electric shock the first time, and then dispenses 1 utility forever"... "pressing the button delivers an electric shock for the first n times, and then dispenses 1 utility forever"... "pressing the button just always delivers an electric shock". Optimistic choice among embedders keeps pressing the button, because, although it keeps getting shocked, there's always an embedder that promises that the agent's fortunes will turn around on the next turn.

Also, this optimizes for each choice individually, and does not naturally deal with sequences of choices, which is necessary to handle general environments.

Good Properties of ADT:A nice feature of this is that exploration steps are not required for good behavior, which is important because

counterfactuals are not conditionals.Another nice feature of this is that it gets ASP (agent simulates predictor) right, which is a surprisingly hard decision theory problem to do well on. When you are rewarded with 103 dollars for picking action 2, and paid 106⋅Et(met=1) dollars, the best move is to just take action 1 to win the million dollars, but standard logical inductor decision theory converges to taking action 2, because the predictor isn't powerful enough to predict the rare exploration steps, so the agent will learn that action 2 always gets it more money than action 1, and dial up the probability of taking action 2 until it ends up getting a thousand dollars on each round and missing the larger payoff.

However, because the embedder that represents the true environment is plugging things like "always pick 1" and "always pick 2" into the environment, argmax on that environment will converge to copying the "always pick 1" strategy, so the logical inductor learns that argmax will always pick 1, and then the true embedder claims that 106 dollars are attainable. If ADT learns to use the true embedder, then it will converge to always one-boxing, which is the desired behavior.

This same feature also allows it to win on Parfit's Hitchiker and XOR Blackmail (I think, 90% probability)

This does not pay up on counterfactual mugging.

The original paper on Asymptotic Decision Theory had

muchmore restrictive restrictions for good behavior, such as the embedders having to be continuous, all the agents in A having to converge to a single distribution in the limit, all the embedders in E having to converge to a single payoff in the limit when fed a convergent agent, and having to use a continuous analogue of argmax, and in combination, this meant that most of the games you could define (even "predict this sequence of bits") were outside of the class of problems where optimality is guaranteed.Why Haven't I Heard of this Before?Well, for quite a while, we thought it was bad because we thought it crashed into itself in games of chicken, so it got tabled for a while. I'll now go over why the argument in that post is false.

To begin, note that there's a crucial difference between the opponent in the "Spoofer" embedder and in the "Delusion" embedder. In the first case, the only embedder that the opponent is optimizing over is the one with "go-straight bot" (or "swerve bot", depending on what you substitute in). In the second case, the opponent is optimizing over the same three embedders that you are using. ADT with only the "vs. go-straight bot" embedder converges to swerving, and ADT with only the "vs. swerve bot" embedder converges to going straight. So, in the "Spoofer" embedder, the argmax agent will converge to thinking that "go straight" is the best thing to plug in because it gets a straight-swerve outcome.

Assume for the sake of contradiction that the ADT agent (with the 3 embedders listed) converges to going straight. Then on the true environment, it will keep getting straight-straight outcomes, while the "Spoofer" embedder keeps predicting that you'll get straight-swerve outcomes. So, the "Spoofer" embedder eventually fails the reality filter, so the agent will learn to not use it. Both of the remaining embedders advise swerving as getting the best outcome, so the ADT agent converges to swerving, and we have a contradiction.

What went wrong in the original reasoning? As far as I can tell, it originated from accidentally treating "ADT with only one embedder" and "ADT with the three listed embedders" as the same, because "ADT" was used to denote both of them.

Evil Problems and Theorem Failure:To make things more confusing, the second theorem in the ADT paper , about argmax dominating all agents in A, is wrong, and not in a fixable way. There's an explicit counterexample.

It will be instructive to take a detour to talk about drnickbone's evil problem. Defining a "fair" problem in decision theory is necessary, because you can't just say that a decision theory is the best on all problems. Consider the problem where you are up against an opponent who just really hates some particular decision theory, and penalizes anyone that uses it to select actions. Of course, the decision theory of your choice will fail on this problem. So, instead, we would hope that there's some notion of a "fair" problem, such that we can say that a decision theory attains equal or greater utility than all competitors on all fair problems.

An initial attempt at defining a fair problem was one where all agents that select the same action get the same result. The problem is fully action-determined, and your payoff doesn't depend at all on your ritual of cognition. This is the notion of "fair" used in the original ADT paper.

The Evil problem is a decision theory problem that is completely action-determined, and fair by the old definition of "fair" where (decision theory of your choice, heretofore abbreviated as YDT for "Your Decision Theory") gets systematically lower utility than most other competing decision theories. Consider a variant of Newcomb's problem where there are two boxes, and you may select one of them. Omega's decision process is to predict what box YDT would take, and then put 1000 dollars in the

otherbox, and nothing in the box that YDT takes. Also, one of the boxes is lustrous and sparkly and would make a nice addition to your box collection, and you value that box at 10 dollars.Now, you are like "well, I'm using YDT, and Omega is really accurate, and I left my random number generator at home, so no matter which box I pick, I'll get 0 dollars. Might as well go for the shiny box". And all the other decision theories like CDT and EDT can run through that reasoning and take the blank box, which contains 1000 dollars, and walk away substantially richer than before. Note that any agent that takes the same box gets the same payoff. So it is intuitively unfair, despite being completely action-determined.

This same sort of problem, when put into embedder form, leads to argmax systematically underperforming "take the blank box". The argmax agent will converge to taking the shiny box about 50.25% of the time. This is because, since it is possible to go back and compute what "blank-box" and "shiny-box" would have yielded on turns where you didn't take that box, you keep thinking that they'd get decent payoffs. In expectation, "blank-box" gets 502.5 dollars, and "shiny-box" gets 502.5 dollars, while "argmax" gets 5.025 dollars. This is an issue for

alldecision theories that operate by treating the environment as a function, and plugging actions into it. It will always consider the action it didn't take to have a higher utility, and this is actually true, because there are "objective counterfactuals" that can be checked. In particular, condition 1 in my probabilistic tiling post was required specifically to rule out these sorts of shenanigans. Check the discussion on why CDT fails this condition, it's talking about how these sorts of evil problems cause the expected utility of "I take an action" to systematically deviate from the expected utility of the action you actually take.To return back to ADT, since it's treating the environment as a function that it can plug stuff into, it's vulnerable to this exploit.

So, what went wrong in the original proof that argmax dominates everything in A? It was the (not explicitly listed) step that went from limt→∞Et(Et(at)−Et(a′t))=0 (which is true because a and a′ are in a maximal equivalence class) to limt→∞Et(|at,a′t|)=0 (which is a necessary step to get argmax to have the same property). However, just because the utilities are the same doesn't mean the probability distributions are the same!! In fact, given the continuous version of argmax in that paper, it's possible to come up with a much more mundane case where it fails that

definitelyisn't an evil problem. Consider the problem where you must make the same move as your opponent, and your probability distribution over moves is the same as your opponent's probability distribution over moves. Let a and a′ be the agents that just implement the two constant strategies. E(a)=E(a′)=1 , but, by the definition of continuous argmax that was given in the paper, argmax would converge to a 50/50 mix of the two strategies since they are of equal value. When this is substituted into both your and your opponent's moveslot, it produces an expected utility of 0.5.What is Fairness?:These evil problems are a problem for showing that argmax dominates everything in A. While attempting a proof, I came up with a possible alternate definition of "fair", but it's more accurate to call it "regret-free". An embedder E is "regret-free" iff

∀F:liminft→∞Et(Et(amEt))−Et(Et(amFt))≥0 .

Intuitively, argmax doesn't regret its decision, in the sense of not wishing it was optimizing for some specific other world to get a more fortunate sequence of actions. The agent doesn't wish it was deluded about the world in order to be decorrelated with whatever is punishing its action sequence. One effect of this definition is that it shows that (according to the expectation of the agent), argmax will do better than any particular strategy in the strategy set, because you can consider an environment that rewards that exact strategy being played on every round. Another notable effect is that it rules out the original Evil problem as "unfair", but

the modified variant where the action of the agent is substituted into both your action, and Omega's prediction, is fair.So there's still a regret-free/"fair" environment that can model the Evil problem faithfully, but it says that the proper thing to judge YDT against isn't environments where Omega penalizes YDT, but rather environments where Omega penalizes the decision theory of whoever is picking the boxes. And of course, CDT, EDT, and everything else also gets either 0 or 10 dollars worth of value in this modified problem, and balance is restored.Sadly, this particular definition of "fair" is inadequate for general use, because it is possible to construct environments where some arbitrary decision theory other than argmax systematically gets lower utility than argmax. This can be fixed by making this definition relative to the agent under consideration, but then you run into the problem of simple agents calling (intuitively fair) games "unfair relative to me". There will be another post about this.

Lack-of-Proof-of-Optimality for ADT:We might try to go for a theorem like

Conjecture 1:If all environments inEare regret-free, then, on all embeddersEs.t.liminft→∞REt>0, ADTdominatesamE .In short, ADT seems like it will learn to do as well as argmax on all environments that don't get ruled out. This conjecture is still open! I thought I had a proof, and it failed, and then I thought I had a disproof, and it failed, and then I thought I had another disproof, and it turned out that I couldn't show that one of the environments was regret-free. Maybe some extra conditions are required for this conjecture to be a theorem, but I suspect that the environment is actually regret-free and the conjecture is false, pointing to a genuine hole in ADT.

I'll describe the first and last of these failed attempts here, in the hopes that they will provide help on how to conclusively prove or disprove the conjecture.

Failed Attempt 1:To begin with, if the true environment E is in E, and

oneother environment F is "clearly in the lead" (ie, ∀F′∈E:RFt⋅Et(Ft(amFt))≥RF′t⋅Et(F′t(amF′t))+ε for some fixed ε ), and this opportunity recurs infinitely often, it is possible to money-pump this. In particular, since REt (the reality filter for the true environment) converges to 1, Et(Ft(amFt))≥Et(Et(amEt))+ε. The money pump works by buying one share in Et(amFt), selling one share in Ft(amFt), selling one share in Et(ADTt), and buying one share in Ft(ADTt). Because both of the latter have prices very similar to each other (because both E and F pass the reality-filter), their price cancels out, yielding ε dollars upfront from the first buy-sell pair, along with the fact that E is a regret-free environment, so Et(amFt) is priced at or below Et(amEt) in the limit. Now, because the embedder F will be actually picked, all the purchased and sold shares have value that cancels out to 0. (well, actually, it's a bit more complicated than that, you need to add an additional condition that the information about what ADT copies shows up in the deductive process in a timely manner, because logical inductors judge the value of a trader by their worst-case value.) Anyways, this pumps free money from each time whereoneparticular embedder is "in the lead" above the true environment.However, this doesn't rule out cases where multiple embedders that aren't the true environment are "in the lead". Intuitively, it's possible to have a pileup where multiple embedders have approximately the same estimate of their argmax value, and estimate of what ADT actually does in them. There's another case where the true environment is "in the lead" along with several others. The obvious fix to this is to have the trader buy and sell conditional contracts, so the money-pump goes through for whichever embedder is actually selected, and all the others cancel out to 0 value. The point where this fails is that it wouldn't necessarily be buying the shares as originally stated. In order to get the Ft(amFt) share, it would be purchasing the share at a price of Et(Ft(amFt)|ADTt=amFt) which may not equal Et(Ft(amFt)). Attempts to construct more sophisticated money-pumps all met with failure due to this specific issue where regret-free environments may say that argmax does worse when ADT copies it.

Failed Attempt 2:We will give a set of environments, and see that two of the embedders will converge to passing the reality filter, but ADT dominates neither of the argmax agents (however, one of the environments is not known to be regret-free.)

In particular, let the true environment be a game of chicken against the same ADT agent. A will consist of the three actions "swerve", "straight", and "consult random number generator in environment to swerve 90% of the time" (augmenting A to allow consulting the random number generator with different randomization probabilities doesn't change the result.) E will consist of the two environments E (which substitutes the action into both your action slot and the opponent's action slot), and F (which substutes the action in your action slot, and the opponent is you. The utility function is 1 for going straight while your opponent swerves, 0 for crashing, and 0.8 for swerving.

To begin with, Et(E(ADTt))≃tEt(F(ADTt)), because both environments are identical when you plug in the agent itself, so both of them pass the reality filter. Argmax

on the embedder E converges to taking the action "swerve 90% of the time".

Assume that ADT learns to use the embedder E, predictably, infinitely often (that is, infinitely often, the market assigns a probability very near 1 of using the embedder E). Then, on this sequence, it will converge to swerving 90% of the time, and in that case, ADT will learn to always use the embedder F in response, and go straight, because this passes the reality filter and offers a payoff of 0.9, which is greater than the payoff of 0.81 that E offers. So, we have a contradiction.

Assume that ADT learns to use the embedder F, predictably, infinitely often. Then, if the probability ADT assigns to itself going straight is <0.8−ϵ, it goes straight, if the probability ADT assigns to itself going straight is >0.8+ϵ, it swerves. So the probability ADT assigns to itself going straight on that subsequence will converge to 0.8, which means that Et(F(amFt))≃t0.8, and because E promises a greater utility, it will learn to use the embedder E in those cases. So again, we have a contradiction.

So, according to the probabilities of the logical inductor itself, it is unpredictable what ADT will do, and by calibration, this applies to what ADT does in reality. Since differences in Et(F(amFt)) and Et(E(amEt)) allow predicting which embedder ADT will pick, the two must converge to each other, and since the latter converges to 0.81, the former must as well.

Now (I'm moderately sure it's the only solution, but don't have a proof), there is a stable solution where the agent doesn't know which embedder it picks, but with 90% probability it picks E, and with 10% probability it picks F (and argmax always picks "go straight", so it runs into itself), for an expected utility of 0.729, while both embedders claim that argmax for them specifically gets an expected utility of 0.81. Because it's unpredictable which embedder it will go with, the expected value of going straight (in embedder F) is 0.81, the expected value of swerving is 0.8, and the expected value of randomizing is 0.801. So argmax converges to always going straight, for an expected utility (in embedder F) of 0.81.