Geometric Exploration, Arithmetic Exploitation

Scott Garrabrant

This post is going to mostly be propaganda for Thompson sampling. However, the presentation is quite different from the standard presentation. I will be working within a toy model, but I think some of the lessons will generalize. I end with some discussion of fairness and exploitation.

A Exploration/Exploitation Toy Model

Imagine you are interacting with the world over time. We are going to make a couple substantial assumptions. Firstly, we assume that you know what you are trying to optimize. Secondly, we assume that you get high quality feedback on the thing you are trying to optimize. In the spirit of the online learning community, we will be referring to your goal as "reward" rather than "utility."

On day , you choose an action, $a_{t} \in A$ , and then the environment responds with an observation $e_{t} \in E$ . Both you and the environment may respond randomly, and may react to the entire history of what has happened. (If you want the environment to act first, you could just make $a_{0}$ not matter.) We assume we have some fixed bounded reward function $r : E \to R$ , that you know. (If you want reward to be more of a function of the history, you can just include more history information in $e_{t}$ .)

We are going to be imagining agents that are myopically choosing $a_{t}$ , so as to maximize $r (e_{t})$ . The choice to do this, rather than e.g. maximizing the total reward of the next $m$ time steps, is mostly for simplicity. However, the choice not to have preferences that are over the entire future is basically the assumption that you get high quality feedback on what we are trying to optimize, which is a substantial philosophical assumption.

So, the environment can be thought of as a function, that takes in a history, $(a_{0} e_{0}) (a_{1} e_{1}) \dots (a_{t - 1} e_{t - 1}) a_{t}$ , and returns a probability distribution on $e_{t}$ which we then sample from. If the agent knew what the environment was, the agent would just choose the $a_{t}$ each round which maximizes the the expectation of $r (e_{t})$ . However, we want to model an agent that is uncertain about what environment it is interacting with, and learning over time.

Let us say that the agent has some probability distribution over possible environments. The set $H$ (for hypotheses) of possible environments is the set of functions that take in a string of the form $(a_{0} e_{0}) (a_{1} e_{1}) \dots (a_{t - 1} e_{t - 1}) a_{t}$ , and output a distribution on $E$ . We are given a distribution $μ$ on $H$ . I will discuss three different decision procedures to get an action out of this distribution $μ$ , which I will call AIXI, plurality, and Thompson sampling.

AIXI

The thing I am calling the AIXI decision procedure (not to be confused with actual AIXI proposal, which among other things assumes that $μ$ is the universal prior) works as follows.

$a_{t} (O) = {argmax}_{p \in Δ A} E_{h \sim (μ | O)} E_{a_{t} \sim p} E_{e_{t} \sim h (O, a_{t})} r (e_{t})$ ,

where $O = a_{0} e_{0} \dots a_{t - 1} e_{t - 1}$ represents the observations on previous days, and $(μ | O)$ represents the posterior distribution on hypotheses you get after conditioning on $O$ .

The above definition is basically just saying "maximize the expected reward in the next round", but is separating the expectation up into three parts. The inner most expectation is the expected reward of a given hypothesis and an action. The middle expectation is the expectation over what action the agent ends up taking. The outer expectation is the expectation over various different hypotheses about what kind of environment the agent is interacting with. In the outer expectation, the agent is conditioning on all of its observations so far, so the probability of each hypothesis is scaled by the probability that hypothesis would have assigned to the past environmental behavior. (The agent is not specifically only updating for the outer expectation, it just does not have any effect on the inner two expectations.)

You may notice there are two sources of uncertainty over the environment. There is the uncertainty in $μ$ over various hypotheses, and there is also uncertainty within each hypothesis coming from the fact that the hypotheses are themselves probabilistic. We could combine these into one, and just have all hypotheses be deterministic, (which is what the original AIXI does). This would have no effect on the AIXI proposal, but it would have an effect on the next two proposals.

Further, you may notice, that there is no incentive to randomize here, so we could get rid of another source of uncertainty by only considering deterministic actions. Again, this will have no effect here, but will have an effect in our final proposal.

Plurality

The thing I am calling the plurality decision procedure works as follows. Let $O$ and $(μ | O)$ be defined as above. Given a hypothesis $h \in H$ , let $m (h, O) = {argmax}_{a \in A} E_{e_{t} \sim h (O, a)} (r (e_{t}))$ . In other words, $m (h, O)$ is the action that $h$ believes gives the highest expected utility, given observation $O$ .

$a_{t} (O) = {argmax}_{p \in Δ A} E_{h \sim (μ | O)} P_{a_{t} \sim p} (a_{t} = m (h, O))$ .

This can be thought of as putting your choice of action up to a vote. Each hypothesis chooses its favorite action. The expectation and the probability in the above formula combine into one big probability over hypotheses and actions, so you are essentially choosing so as to maximize the probability that a randomly chosen hypothesis endorses your randomly chosen action as the best action.

Like last time, there is no incentive to randomize, so we can actually view this as choosing an action that maximizes the probability that a randomly chosen hypothesis thinks that is the best action.

Unlike with the AIXI proposal, it now does matter where we draw the line between uncertainty within the hypothesis, and uncertainty across different hypotheses, because uncertainty within hypotheses essentially forms voting blocs.

The main benefit of this proposal over the last, (more elegant) proposal, is that it has some resistance to Pascal's mugging situations, because a hypothesis still only gets to vote proportional to its probability, regardless of how much utility differential it promises you.

Thompson Sampling

Let $O$ , $(μ | O)$ , and $m (h, O)$ be defined as above. The thing I am calling the Thompson sampling proposal works as follows:

$a_{t} (O) = {argmax}_{p \in Δ A} G_{h \sim (μ | O)} P_{a_{t} \sim p} (a_{t} = m (h, O))$ ,

where $G$ is the geometric expectation, and is equivalent to the exponential of the expectation of the logarithm. (The only difference between this and the plurality proposal is replacing one arithmetic expectation with a geometric expectation.)

In this proposal, there is incentive to randomize. Indeed, you will end up choosing each action with probability equal to the probability that a randomly chosen hypothesis thinks that that action is the best action (modulo confusing stuff related to ties, which we can ignore). If you have been following my recent posts, you should not be surprised that proportional representation comes out of taking a geometric expectation.

In a sentence, Thompson sampling can be thought of as "Geometrically maximize, with respect to your posterior on hypotheses, the probability that you choose the best action."

Once again, this is not the standard presentation of Thompson sampling. Indeed, Thompson sampling is not normally presented in a way involving a geometric expectation (or even an expected logarithm) at all. It is instead presented as probability matching, but the probability matching can be reformulated as geometric maximization.

Density Zero Exploration

Let us assume the true hypothesis $h_{t r u e}$ is given positive probability in $μ$ . We would like it to be the case that we are able to learn this fact, and act accordingly. In particular, we would like it if the limit as $t$ goes to infinity of the probability that $a_{t} = m (h_{t r u e}, O)$ converges to 1. Thompson sampling has this property. AIXI and plurality do not.

To see this, imagine that there is a hypothesis, $h_{w o l v e s}$ that is almost true, but thinks that the action $a$ will cause you to get eaten by wolves, and thus get very low reward.

AIXI and plurality will, if their priors assign enough probability to the $h_{w o l v e s}$ , never choose $a$ , and never update, because $h_{w o l v e s}$ will never make any verifiably wrong predictions. If the true hypothesis would have you take action $a$ some of the time, you will never learn this, and never take action $a$ .

The standard fix for this is $ε$ exploration. If you choose a random action with probability $ε$ , and otherwise use AIXI or plurality, you will successfully learn there are no wolves, and take the right action (with probability $1 - ε$ ). But Thompson sampling takes the right action with probability converging to 1!

You might think you can just have your exploration change over time and send $ε_{t}$ to 0 slowly, like by taking $ε = 1 / t$ . But nope, this doesn't work. If $h_{w o l v e s}$ only says you will get eaten by wolves on days that are a power of 2, it will with positive probability never make a wrong prediction, and still control your action (using AIXI or plurality) some of the time forever. Importantly, it isn't just controlling your action at random times. Maybe the days that are a power of 2 are the only ones that really matter.

It is actually hard to explore enough to not believe $h_{w o l v e s}$ without exploring with positive density. (By positive density, I mean exploring such that the limit as $t$ goes to infinity of the proportion of previous days you explored does not go to 0.) However, Thompson sampling pulls this off, it part by essentially listening to the hypotheses about when is the best time to explore, and not exploring when all the hypotheses agree.

Thompson sampling is exploring just enough to learn, which is to say it is asymptotically exploring almost never! AIXI and plurality are not exploring at all.

(As a quick sketch of an argument for why Thompson sampling has the right behavior here, observe that every time you randomly take the wrong action, every hypothesis that gave that action higher expected utility than the best action will (on average) scale down its probability relative to $h_{t r u e}$ , If the best action gives an amount of reward bounded away from 0 more than all other actions, we can bound how much measure each hypothesis that disagrees with $h_{t r u e}$ must lose. Any hypothesis that disagrees with $h_{t r u e}$ on what is the best action infinitely often will lose all its relative measure.)

The AM-GM Boundary

In Thompson sampling, we have two levels of uncertainty over the environment. There is uncertainty between hypotheses, and there is uncertainty within hypotheses. The uncertainty within hypotheses is combined arithmetically (with $E$ ) to select a favorite action (distribution) for each hypothesis, and then the uncertainty between hypotheses is combined geometrically (with $G$ ) to select a favorite action distribution overall. This structure is actually important for exploring correctly here.

Similarly, in Nash bargaining, uncertainty within individuals is resolved arithmetically (to get each individual's expected utilities), and uncertainty between individuals is resolved geometrically.

I call this distinction the AM-GM boundary (arithmetic mean/geometric mean boundary). There is a large space of stuff, and that stuff is combined using a geometric mean by default, but there are small clusters in which the stuff is combined using an arithmetic mean.

$G_{across clusters} E_{within each cluster} stuff$

There is a sort of boundary around the arithmetic mean clusters, where the outer geometric mean treats them as single units.

If you think of $G$ as enforcing proportional representation or fairness, then the arithmetic clusters are places where we say, "There is one person there. Its subparts are allowed to exploit each other."

Exploiting here is referring to maximizing the total quantity of stuff with no regard to fairness or egalitarianism, or whether everyone in the cluster gets some.

The AIXI proposal above illustrates the failure mode of making the epistemic arithmetic clusters too big. There is one big arithmetic cluster in that proposal. The concept of a utility monster instrumentally demonstrates the same thing.

On the other hand, if you make your arithmetic clusters too small, you could end up taking actions in proportion to their utilities, and effectively never maximizing at all.

I think where you draw this boundary depends on context. There are places where I endorse my subagents exploiting each other, and other places where I do not. Similarly for my family, my larger tribe, my country, my species, and my Everett branch. There is a hard question here related to personal autonomy. I am only trying to give a language to talk about what is going on.

Similarly to talking about expanding circles of moral concern, we can talk about expanding circles of exploitation. Maybe eventually we want our circle of exploitation to be large. Maybe we don't. Maybe even if we want it to be large eventually, we want to keep it small for now while we are still exploring and learning.

Super interesting series! Your second post actually gave me some insights on a problem that was a big part of my phd, but it will take me some time to think through. So here is a ~~simpler~~ unrelated comment.

Thompson sampling seems to be too perfectionistic to me: it wants to take an action that is optimal in some world, rather than one that is merely great in all worlds. For example, suppose you have a suitcase with a 6 digit lock. In each turn, you can make a guess at the correct combination. If you guess correctly, you get 10 utils, if you guess wrong you get 1 util. If you don’t guess but instead pass, you get 9 utils.

(The model is supposed to be greedy and ignore future advantages so information revealed from a wrong guess shouldn’t matter. If this bothers you, you can imaging that when you “pass”, you still get to make a guess at the correct combination, and you learn if it is correct, but you get 9 utils no matter if your guess is correct or not.)

Each possible combination is considered a hypothesis so each hypothesis is deterministic. Thompson sampling would never pass, because that is not the optimal thing to do for any particular hypothesis (although it is the optimal thing to do when you don’t know which hypothesis is true). Instead it would keep guessing combinations, because each combination is optimal is some hypothesis.

This is not an issue of AM vs. GM: the same thing happens in the plurality decision procedure when we only use AM. It is an issue about what we take argmax over. If we consider the combination to be due to randomness within a single hypothesis, the decision procedures will correctly choose to pass until the correct combination has been revealed.

Possibly related question: Is there any reason the AM-GM boundary is at the same level as where we take argmax in the definition of m? Or could we have an arithmetic expectation outside of argmax (hence outside of m) but inside the geometric expectation? Or even have a geometric expectation inside of argmax in m but possibly over the arithmetic expectation?

Yeah, the Thompson sampling and Nash bargaining are different in that the Thompson sampling proposal has two argmaxes, where as Nash bargaining only has one. There are really two things being brought it with Thompson sampling, and Plurality is what you get if you only add the inner argmax, and something like Nash bargaining is what you get if you only add the geometric part. There is no reason you have to add the two things at the same place. All I know is Thompson sampling has some pretty nice asymptotic guarantees.

You could just Nash bargain between your hypotheses directly, but then you are dependent on where the 0 point is. One nice thing about Thompson sampling is that it gives you a semi-principled place to put the 0, because the inner argmax means we convert everything to probabilities.

Thanks for your posts, Scott! This has been super interesting to follow.

Figuring out where to set the AM-GM boundary strikes me as maybe the key consideration wrt whether I should use GM -- otherwise I don't know how to use it in practical situations, plus it just makes GM feel inelegant.

From your VNM-rationality post, it seems like one way to think about the boundary is commensurability. You use AM within clusters whose members are willing to sacrifice for each other (are willing to make Kaldor-Hicks improvements, and have some common currency s.t. "K-H improvement" is well-defined; or, in another framing, have a meaningfully shared utility function) . Maybe that's roughly the right notion to start with? But then it feels strange to me to not consider things commensurate across epistemic viewpoints, especially if those views are contained in a single person (though GM-ing across internal drives does seem plausible to me).

I'd love to see you (or someone else) explore this idea more, and share hot takes about how to pin down the questions you allude to in the AM-GM boundary section of this post: where to set this boundary, examples of where you personally would set it in different cases, and what desiderata we should have for boundary-setting eventually. (It feels plausible to me that having maximally large clusters is in some important sense the right thing to aim for).

Wow, I came here to say literally the same thing about commensurability: that perhaps AM is for what's commensurable, and GM is for what's incommensurable.

Though, one note is that to me it actually seems fine to consider different epistemic viewpoints as incommensurate. These might be like different islands of low K-complexity, that each get some nice traction on the world but in very different ways, and where the path between them goes through inaccessibly-high K-complexity territory.

Great post!

We in universal AI already have a name for the generalization of AIXI that may use a different history distribution: AI\mu. Since you intend \mu to be a distribution over hypotheses, NOT the true environment as Marcus Hutter usually uses \mu, I would also preferably replace \mu -> \nu. This may seem pedantic but it's kind of triggering to use AIXI to refer to AI\nu, because the "XI" part of "AIXI" is (sometimes) intended to translate as \xi, Hutter's chronological version of the universal distribution.