# Ω 7

Decision TheoryLogical Induction
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(asymptotic decision theory, initially detailed in this paper) was a proposed decision theory with logical inductors, developed after EDT with logical inductors and exploration steps. It is a possible candidate for conceptual progress in decision theory, but some basic questions about its performance are still unsettled, and there are several issues with the current implementation of it.

## Definitions:

A market is a computable function with type (where is the set of sentences of math) that will be denoted as . We will be considering logical inductors in this setting, and in particular, each finite stage of a logical inductor is a market.

Let be a finite set of actions. An agent is a computable sequence of computable functions of type . It takes a timestep and a probability distribution, and selects a distribution over actions. Because logical inductors associate each timestep with an action, specifying the logical inductor and the agent specifies a sequence of actions. Because we will be assuming some fixed logical inductor in the background, we will suppress it in the notation and refer to the action produced by an agent at time t as or . Randomization can be handled by using the diagonalization sentences used to define exploration (this acts in such a way that no trader in the inductor is able to reliably predict what the agent does), or it can be handled by having one of the actions be to call a random number generator in the environment.

An embedder is a sequence of randomized functions of type (denoted by or , with the function at time t being denoted by or ). Let denote the random variable that corresponds to the environment using a uniform random distribution of bitstrings for its randomness tape. An embedder must have the probability distribution of being computably approximable on computable inputs.

An argmax agent is an agent that takes a finite set of other agents , and a single embedder , and outputs on turn . is defined as it usually is for logical inductors. Notice that, although may be very difficult to compute, putting it inside an expectation allows a logical inductor to output a decent guess as to what it is anyways. The agents all had to be computable in order for the argmax agent to duplicate their behavior at arbitrary turns. There is a dependence on the set , the embedder , and the logical inductor, but because the logical inductor and will be treated as fixed, we will write the time-t action produced by the argmax agent as .

will denote the "true environment", which is a sequence of values in .

Finally, the logical inductor will be required to have "fast feedback". This means that after each turn, the following information will show up in the deductive process at time .

1: (an interval containing) the true value of .

The deductive process is the sequence of theorems that the logical inductor sees. The reason that the true utility has to be in an interval, instead of reported directly, is because the true utility may have arbitrarily many digits, which hampers the ability of traders to use that information to judge bets on how will turn out. This condition is present to ensure that, if is generated by taking and feeding it a uniform random bitstring, .

Theorem 1: If is generated on each turn by running with a bitstring sampled uniformly at random, then .

We will apply the basic trading strategy from Theorem 4.5.9 in the logical induction paper. In this case, if the left-hand side is overpriced by infinitely often (by a failure of convergence, and the same argument can be applied symmetrically to the right-hand side being overpriced infinitely often), the traders buy a fraction of a share of and sell a fraction of a share of , and keep doing it until 1 share has been moved in total. This takes care of having -generable trade magnitudes. According to the law of large numbers, with probability 1, there is a finite time after which all of the sub-traders have their pile of shares have a value within of , so the initial traders in the efficiently emulatable sequence of traders can be clipped off, and all the other sub-traders get or more value from selling the high-priced expectation and buying the low-priced one, so the necessary conditions for the -ROI Lemma are fulfilled and the resulting trader exploits the market.

Finally, we will define dominance. An agent dominates an agent on an embedder if

This could be thought of as having sublinear regret relative to .

The asymptotic decision theory algorithm works as follows.

Inputs: A sequence of numbers asymptotically decreasing to , denoted as , a finite list of embedders , a finite list of agents , and a logical inductor with fast feedback on the true environment,

Ok, so what does this do intuitively? Well, it takes all the embedders, and runs them through a "reality filter" which checks whether the embedder, run on the agent itself, replicates what the true reality actually does, in expectation. For all the embedders that pass the reality filter, it uses the inductor to assess what score argmax gets in that embedder, takes the one that gets the highest score, and then implements argmax for that embedder. In short, it optimistically chooses the embedder where the highest score is attainable (that isn't wrong about reality), and optimizes for that. If it got a good score, that's fine. If it got a lower score than it was expecting, that embedder is less likely to pass the reality filter next time, because undershot . There isn't a problem (as there is in standard logical inductor decision theory) of systematically underscoring an embedder forever, because since an embedder is a randomized function, it's possible to actually approximate a distribution over what it outputs, so argmax will eventually catch on and start taking (predicted-to-be) optimal actions, and the value of these can also be empirically assessed.

The most prominent undesirable feature of this is that it's restricted to a finite set of embedders. Optimistic choice fails very badly on an infinite set of embedders, because we can consider an infinite sequence of embedders that are like "pressing the button dispenses 1 utility forever", "pressing the button delivers an electric shock the first time, and then dispenses 1 utility forever"... "pressing the button delivers an electric shock for the first n times, and then dispenses 1 utility forever"... "pressing the button just always delivers an electric shock". Optimistic choice among embedders keeps pressing the button, because, although it keeps getting shocked, there's always an embedder that promises that the agent's fortunes will turn around on the next turn.

Also, this optimizes for each choice individually, and does not naturally deal with sequences of choices, which is necessary to handle general environments.

A nice feature of this is that exploration steps are not required for good behavior, which is important because counterfactuals are not conditionals.

Another nice feature of this is that it gets ASP (agent simulates predictor) right, which is a surprisingly hard decision theory problem to do well on. When you are rewarded with dollars for picking action 2, and paid dollars, the best move is to just take action 1 to win the million dollars, but standard logical inductor decision theory converges to taking action 2, because the predictor isn't powerful enough to predict the rare exploration steps, so the agent will learn that action 2 always gets it more money than action 1, and dial up the probability of taking action 2 until it ends up getting a thousand dollars on each round and missing the larger payoff.

However, because the embedder that represents the true environment is plugging things like "always pick 1" and "always pick 2" into the environment, argmax on that environment will converge to copying the "always pick 1" strategy, so the logical inductor learns that argmax will always pick 1, and then the true embedder claims that dollars are attainable. If learns to use the true embedder, then it will converge to always one-boxing, which is the desired behavior.

This same feature also allows it to win on Parfit's Hitchiker and XOR Blackmail (I think, 90% probability)

This does not pay up on counterfactual mugging.

The original paper on Asymptotic Decision Theory had much more restrictive restrictions for good behavior, such as the embedders having to be continuous, all the agents in having to converge to a single distribution in the limit, all the embedders in having to converge to a single payoff in the limit when fed a convergent agent, and having to use a continuous analogue of argmax, and in combination, this meant that most of the games you could define (even "predict this sequence of bits") were outside of the class of problems where optimality is guaranteed.

## Why Haven't I Heard of this Before?

Well, for quite a while, we thought it was bad because we thought it crashed into itself in games of chicken, so it got tabled for a while. I'll now go over why the argument in that post is false.

To begin, note that there's a crucial difference between the opponent in the "Spoofer" embedder and in the "Delusion" embedder. In the first case, the only embedder that the opponent is optimizing over is the one with "go-straight bot" (or "swerve bot", depending on what you substitute in). In the second case, the opponent is optimizing over the same three embedders that you are using. with only the "vs. go-straight bot" embedder converges to swerving, and with only the "vs. swerve bot" embedder converges to going straight. So, in the "Spoofer" embedder, the argmax agent will converge to thinking that "go straight" is the best thing to plug in because it gets a straight-swerve outcome.

Assume for the sake of contradiction that the agent (with the 3 embedders listed) converges to going straight. Then on the true environment, it will keep getting straight-straight outcomes, while the "Spoofer" embedder keeps predicting that you'll get straight-swerve outcomes. So, the "Spoofer" embedder eventually fails the reality filter, so the agent will learn to not use it. Both of the remaining embedders advise swerving as getting the best outcome, so the agent converges to swerving, and we have a contradiction.

What went wrong in the original reasoning? As far as I can tell, it originated from accidentally treating " with only one embedder" and " with the three listed embedders" as the same, because "" was used to denote both of them.

## Evil Problems and Theorem Failure:

To make things more confusing, the second theorem in the ADT paper , about argmax dominating all agents in , is wrong, and not in a fixable way. There's an explicit counterexample.

It will be instructive to take a detour to talk about drnickbone's evil problem. Defining a "fair" problem in decision theory is necessary, because you can't just say that a decision theory is the best on all problems. Consider the problem where you are up against an opponent who just really hates some particular decision theory, and penalizes anyone that uses it to select actions. Of course, the decision theory of your choice will fail on this problem. So, instead, we would hope that there's some notion of a "fair" problem, such that we can say that a decision theory attains equal or greater utility than all competitors on all fair problems.

An initial attempt at defining a fair problem was one where all agents that select the same action get the same result. The problem is fully action-determined, and your payoff doesn't depend at all on your ritual of cognition. This is the notion of "fair" used in the original paper.

The Evil problem is a decision theory problem that is completely action-determined, and fair by the old definition of "fair" where (decision theory of your choice, heretofore abbreviated as for "Your Decision Theory") gets systematically lower utility than most other competing decision theories. Consider a variant of Newcomb's problem where there are two boxes, and you may select one of them. Omega's decision process is to predict what box would take, and then put 1000 dollars in the other box, and nothing in the box that takes. Also, one of the boxes is lustrous and sparkly and would make a nice addition to your box collection, and you value that box at 10 dollars.

Now, you are like "well, I'm using , and Omega is really accurate, and I left my random number generator at home, so no matter which box I pick, I'll get 0 dollars. Might as well go for the shiny box". And all the other decision theories like and can run through that reasoning and take the blank box, which contains 1000 dollars, and walk away substantially richer than before. Note that any agent that takes the same box gets the same payoff. So it is intuitively unfair, despite being completely action-determined.

This same sort of problem, when put into embedder form, leads to argmax systematically underperforming "take the blank box". The argmax agent will converge to taking the shiny box about of the time. This is because, since it is possible to go back and compute what "blank-box" and "shiny-box" would have yielded on turns where you didn't take that box, you keep thinking that they'd get decent payoffs. In expectation, "blank-box" gets 502.5 dollars, and "shiny-box" gets 502.5 dollars, while "argmax" gets 5.025 dollars. This is an issue for all decision theories that operate by treating the environment as a function, and plugging actions into it. It will always consider the action it didn't take to have a higher utility, and this is actually true, because there are "objective counterfactuals" that can be checked. In particular, condition 1 in my probabilistic tiling post was required specifically to rule out these sorts of shenanigans. Check the discussion on why fails this condition, it's talking about how these sorts of evil problems cause the expected utility of "I take an action" to systematically deviate from the expected utility of the action you actually take.

To return back to since it's treating the environment as a function that it can plug stuff into, it's vulnerable to this exploit.

So, what went wrong in the original proof that argmax dominates everything in ? It was the (not explicitly listed) step that went from (which is true because and are in a maximal equivalence class) to (which is a necessary step to get argmax to have the same property). However, just because the utilities are the same doesn't mean the probability distributions are the same!! In fact, given the continuous version of argmax in that paper, it's possible to come up with a much more mundane case where it fails that definitely isn't an evil problem. Consider the problem where you must make the same move as your opponent, and your probability distribution over moves is the same as your opponent's probability distribution over moves. Let and be the agents that just implement the two constant strategies. , but, by the definition of continuous argmax that was given in the paper, argmax would converge to a 50/50 mix of the two strategies since they are of equal value. When this is substituted into both your and your opponent's moveslot, it produces an expected utility of .

## What is Fairness?:

These evil problems are a problem for showing that argmax dominates everything in . While attempting a proof, I came up with a possible alternate definition of "fair", but it's more accurate to call it "regret-free". An embedder is "regret-free" iff

.

Intuitively, argmax doesn't regret its decision, in the sense of not wishing it was optimizing for some specific other world to get a more fortunate sequence of actions. The agent doesn't wish it was deluded about the world in order to be decorrelated with whatever is punishing its action sequence. One effect of this definition is that it shows that (according to the expectation of the agent), argmax will do better than any particular strategy in the strategy set, because you can consider an environment that rewards that exact strategy being played on every round. Another notable effect is that it rules out the original Evil problem as "unfair", but the modified variant where the action of the agent is substituted into both your action, and Omega's prediction, is fair. So there's still a regret-free/"fair" environment that can model the Evil problem faithfully, but it says that the proper thing to judge against isn't environments where Omega penalizes , but rather environments where Omega penalizes the decision theory of whoever is picking the boxes. And of course, , , and everything else also gets either 0 or 10 dollars worth of value in this modified problem, and balance is restored.

Sadly, this particular definition of "fair" is inadequate for general use, because it is possible to construct environments where some arbitrary decision theory other than argmax systematically gets lower utility than argmax. This can be fixed by making this definition relative to the agent under consideration, but then you run into the problem of simple agents calling (intuitively fair) games "unfair relative to me". There will be another post about this.

We might try to go for a theorem like

Conjecture 1: If all environments in are regret-free, then, on all embedders s.t. , dominates .

In short, seems like it will learn to do as well as argmax on all environments that don't get ruled out. This conjecture is still open! I thought I had a proof, and it failed, and then I thought I had a disproof, and it failed, and then I thought I had another disproof, and it turned out that I couldn't show that one of the environments was regret-free. Maybe some extra conditions are required for this conjecture to be a theorem, but I suspect that the environment is actually regret-free and the conjecture is false, pointing to a genuine hole in .

I'll describe the first and last of these failed attempts here, in the hopes that they will provide help on how to conclusively prove or disprove the conjecture.

Failed Attempt 1:

To begin with, if the true environment is in , and one other environment is "clearly in the lead" (ie, for some fixed ), and this opportunity recurs infinitely often, it is possible to money-pump this. In particular, since (the reality filter for the true environment) converges to , . The money pump works by buying one share in , selling one share in , selling one share in , and buying one share in . Because both of the latter have prices very similar to each other (because both and pass the reality-filter), their price cancels out, yielding dollars upfront from the first buy-sell pair, along with the fact that is a regret-free environment, so is priced at or below in the limit. Now, because the embedder will be actually picked, all the purchased and sold shares have value that cancels out to . (well, actually, it's a bit more complicated than that, you need to add an additional condition that the information about what copies shows up in the deductive process in a timely manner, because logical inductors judge the value of a trader by their worst-case value.) Anyways, this pumps free money from each time where one particular embedder is "in the lead" above the true environment.

However, this doesn't rule out cases where multiple embedders that aren't the true environment are "in the lead". Intuitively, it's possible to have a pileup where multiple embedders have approximately the same estimate of their argmax value, and estimate of what actually does in them. There's another case where the true environment is "in the lead" along with several others. The obvious fix to this is to have the trader buy and sell conditional contracts, so the money-pump goes through for whichever embedder is actually selected, and all the others cancel out to value. The point where this fails is that it wouldn't necessarily be buying the shares as originally stated. In order to get the share, it would be purchasing the share at a price of which may not equal . Attempts to construct more sophisticated money-pumps all met with failure due to this specific issue where regret-free environments may say that argmax does worse when copies it.

Failed Attempt 2:

We will give a set of environments, and see that two of the embedders will converge to passing the reality filter, but dominates neither of the argmax agents (however, one of the environments is not known to be regret-free.)

In particular, let the true environment be a game of chicken against the same agent. will consist of the three actions "swerve", "straight", and "consult random number generator in environment to swerve of the time" (augmenting to allow consulting the random number generator with different randomization probabilities doesn't change the result.) will consist of the two environments (which substitutes the action into both your action slot and the opponent's action slot), and (which substutes the action in your action slot, and the opponent is you. The utility function is for going straight while your opponent swerves, for crashing, and for swerving.

To begin with, , because both environments are identical when you plug in the agent itself, so both of them pass the reality filter. Argmax
on the embedder converges to taking the action "swerve of the time".

Assume that learns to use the embedder , predictably, infinitely often (that is, infinitely often, the market assigns a probability very near 1 of using the embedder ). Then, on this sequence, it will converge to swerving of the time, and in that case, will learn to always use the embedder in response, and go straight, because this passes the reality filter and offers a payoff of , which is greater than the payoff of that offers. So, we have a contradiction.

Assume that learns to use the embedder , predictably, infinitely often. Then, if the probability assigns to itself going straight is , it goes straight, if the probability assigns to itself going straight is , it swerves. So the probability assigns to itself going straight on that subsequence will converge to , which means that , and because promises a greater utility, it will learn to use the embedder in those cases. So again, we have a contradiction.

So, according to the probabilities of the logical inductor itself, it is unpredictable what will do, and by calibration, this applies to what does in reality. Since differences in and allow predicting which embedder will pick, the two must converge to each other, and since the latter converges to , the former must as well.

Now (I'm moderately sure it's the only solution, but don't have a proof), there is a stable solution where the agent doesn't know which embedder it picks, but with probability it picks , and with probability it picks (and argmax always picks "go straight", so it runs into itself), for an expected utility of , while both embedders claim that argmax for them specifically gets an expected utility of . Because it's unpredictable which embedder it will go with, the expected value of going straight (in embedder ) is , the expected value of swerving is , and the expected value of randomizing is . So argmax converges to always going straight, for an expected utility (in embedder ) of .