Unpacking Geometric Rationality

MorgneticField

This is mostly my attempt to understand Scott Garrabrant's Geometric Rationality series. He hasn't written much about it here recently (maybe he's off starting video game companies to save the world). I find the ideas very intriguing, but I also found his explanation from 3 years ago hard to follow. I try to simplify and explain things in my own words here, hoping that you all can correct me where I'm wrong.

The approach I take is to build up to a geometrically rational agent from small pieces, introducing some of the necessary theory as I go. This is a much more mechanical presentation than Garrabrant's philosophical series, but I find this kind of thing easier to follow. After building up to the idea of a geometrically rational agent, I'll follow Garrabrant in showing how this can be used to derive Bayes' rule, Thompson sampling, and Kelly betting.

In the process of unpacking what I think Garrabrant is pointing at with Geometric Rationality, I found a lot of open questions. I took a stab at answering some of them, but be warned that there are a number of unanswered questions here. If you know the answers, please fill me in.

Basic Agent Requirements

Let's start by talking about what an agent is. This is a bit harder than it used to be, because everyone is all excited about LLM agents, which I think obscures some of the key features of an agent.

An agent is something with a goal. It observes the world, and then takes actions in order to manifest that goal in the world. The goal is often described in terms of maximizing some reward signal, though it doesn't need to be.

You are an agent. A thermostat is an agent. An LLM-agent might (or might not) be an agent, depending on how it was coded.

We can describe an agent's process in a few steps:

Observe the world
Orient your observations based on your existing knowledge
Decide what action to take
Act upon the world

An agent does these things repeatedly, in an OODA loop. There are other models for this kind of loop (Sense-Plan-Act, etc.), but I like this one.

There's a lot to be said about Observing the world. Sensor design and selection is a very interesting field. Similarly, Acting can be quite involved and technical (just look up recent research on grasping). We won't focus on Observing and Acting, as agent design per se often takes those for granted. Instead, we'll focus on Orienting and Deciding.

Expected Utility Maximizer

One of the most common ways to model the Orient and Decide stages of an agent is with an Expected Utility Maximizer. This is how it works:

the agent receives an observation about the world
it updates its hypothesis for how the world actually is
for all of its possible actions, it considers the likely outcome if it takes that action
whichever action gives the best outcome, that's the one it selects

Steps 1 and 2 there are the orient phase, and steps 3 and 4 are the decide phase. Generally two things are required in order to be able to do those:

A world model: This is what the agent updates based on observations, and then uses in order to figure out likely outcomes.
A utility function: This is what the agent uses to score outcomes for each action.

If you wanted to represent this in equations, you would write it as follows:

Here's what these symbols mean:

$a_{c}$ - the chosen action
${argmax}_{a \in Δ A}$ - pick the act (out of the probability simplex of all possible Actions) that maximizes what follows
$E$ - the expected value (arithmetic expectation)
$U$ - the utility function, which predicts a reward based on a world state
$W (a | O)$ - a function that outputs a probability distribution describing the predicted world state, given the recent observation $O$ and the action $a$
$s^{'}$ - a possible world state, sampled from $W (a | O)$
$P (s^{'} | s, a)$ the probability of ending in world $s^{'}$ if the agent starts in $s$ and takes action $a$

There are various ways to simplify this (like smooshing the world-model and utility function into one combined thing), but we'll leave it as is. Any agent that observes and orients like this, we'll call an EUM agent.

This model works quite well, and is used all over the place in reinforcement learning, gambling, robotics, etc.

Action Space

Let's pause here to say a bit more about what actions the EUM agent is going to look at. The definition we gave above had ${argmax}_{a \in Δ A}$ . That means we select the action distribution $a$ that provides the highest value for everything that follows. The actions that we consider all drawn from $Δ A$ , which is the probability simplex of A. In other words, it's a set of all of the probability distributions over all the pure actions in A.

As an example, if the only actions available to the agent were turn-left and turn-right, then the action space would be $0 \leq p \leq 1$ , where $p$ is the probably of selecting action turn-left, and $(1 - p)$ is the probability of selecting action turn-right.

A pure strategy is one in which the probability distribution chosen is 1 for a single action and 0 for all other actions. There are two pure strategies for the turn-left/turn-right options.

EUM agents are often described as selecting actions over just the set of actions $A$ , and not the probability simplex $Δ A$ . This is because EUM agents (when not playing with/against other EUM agents) will always select a pure strategy or think some pure strategy is as good as the best mixed strategy. For an EUM agent, pure strategies are always good enough.

To see this, note that the expected value of a mixed strategy ends up being a weighted sum of the expected values of the pure strategies in the mixture.

There exists a pure strategy in this mixture that either:

is the highest EV of any other pure strategy in the mixture, or
is tied for the highest EV with another pure strategy

If we weight that high EV pure strategy with a higher probability in the mixture, the EV of the mixture will:

go up, or
stay the same (if there was a tie)

This means we can maximize an EUM agent's expected score by squeezing all of its action probability into the highest scoring pure action.

I added the probability simplex to the EUM definition because we'll use it a lot later, and it will be convenient to define all the agents as optimizing over the same action space.

Partnerships

Now imagine there are two agents. These agents each have their own utility function and their own world model. These agents would like to join forces so that they can accomplish more together.

A standard way for two EUM agents to work together is Nash bargaining.

In Nash bargaining, each agent follows the same process:

identify the best outcome it could get if it worked alone
1. Nash bargaining often calls this the "threat point", but I think about it as a BATNA. Note that a BATNA could be adversarial (as threat points are sometimes modeled as), but don't have to be.
identify all the outcomes that it could get if it worked with its partner
discard any action/outcome pairs that would leave it worse off than the BATNA
find the difference in utility between the BATNA and each outcome. We'll call this difference the delta-utility
Once the list of (action, delta-utility) tuples has been determined, each agent shares its list
1. Any actions that aren't on both lists are discarded (they would be worse than one agent's BATNA, so that agent wouldn't help in that case)
For each action in the final list, both agents' delta-utilities are multiplied
The action that maximizes this product is selected

As an equation, this would be:

$a_{c} = {argmax}_{a \in Δ A} (E_{s^{'} \sim W_{1} (a)} U_{1} (s^{'}) - U_{b 1}) \cdot (E_{s^{'} \sim W_{2} (a)} U_{2} (s^{'}) - U_{b 2})$

The new symbols here are:

$U_{b 1}$ and $U_{b 2}$ - the BATNA values for agent 1 and agent 2. These are expected utilities calculated over the distribution of likely outcomes for an agent's best non-cooperative action. In other words $U_{b 1} = E_{s^{'} \sim W_{1} (a_{b 1} | O)} U_{1} (s^{'})$ . In this case, $a_{b 1}$ is the best non-cooperative action for agent 1.
Note that each agent uses its own world model to predict outcomes of an action. We subscript the world models with the agent index to clarify this.
Note also that the action set $Δ A$ for this has a different interpretation. The actions under consideration here are actions for the combined forces of agent 1 and agent 2, which may be represented as a cartesian product of their individual actions.

I left any conditioning on observations implicit for this equation, but that's just to keep it easy to read.

One terminology note here: the product is very similar to a geometric mean. To find the geometric mean of two values, you multiply them and then take the square root of the result. Since the square root is monotonically increasing (for positive inputs), it doesn't change an argmax output at all. The product terms used in Nash bargaining are key reason that we're driving towards a geometric theory of decision making.

Why Nash Bargain

Garrabrant has a philosophical justification for Nash bargaining: "When Nash Bargaining, you are really just geometrically maximizing expected utility with respect to your uncertainty about your identity." For him, Nash bargaining is very fair, and provides an ethical foundation for cooperation. He has a large discussion on how adequately disburse utility when behind the veil of ignorance.

This is all well and good, but what are we to make of this for agent design? Our agent knows who it is; why would it not dispense with Nash bargaining and do what is better for its own self? The answer may be in the need for agents to willingly help each other. Agent 1 gets more with agent 2's help than without it, so needs to offer enough to agent 2 to get them to help.

Additionally, we may think of ourselves as the creators of numerous future agents. These future agents are all in some sense our descendants, and we want them to be able to work together. Perhaps we want to encourage fairness between them specifically because we care for them equally.

There's another very nice feature of Nash Bargaining: it lets the agents access mixed strategies. Recall that EUM agents always accept a pure strategy as being at least as good as any mixed strategy. That's not true for Nash bargaining.

To see that Nash bargaining can prefer mixed to pure strategies, notice that it is maximizing a product and not a sum. The product is between a weighted sum of one agent's utilities and that of another. In other words, $U_{N B} = (\sum p_{i 1} U_{i 1}) (\sum p_{i 2} U_{i 2})$ assuming 0 BATNAs. This gives factors that are products of individual utilities, and so can be much higher than the pure utilities themselves. It's the higher order nature of the objective function that means mixed strategies may be preferred by the agents acting together.

Garrabrant phrases this as Nash bargaining allowing agents to agree on a policy, instead of an action. I originally found that confusing, until I realized that his term policy just means a probability distribution over actions (which could include acting with a pure strategy). There's probably more to be said about institution design to assist agents in coordinating, but for now we'll focus on the simple case of mixed strategies. These mixed strategies are crucial to the benefits of Geometric Rationality. We will see later how a mixed strategy protects groups of agents from being Dutch booked.

Combining Utilities in Nash Bargaining

Nash bargaining doesn't depend on units each agent uses for calculating utility. This is an important point for Garrabrant, so lets linger on it. How are we to deal with the differing units when we're taking the product of them?

Garrabrant talks about this extensively in Utilitarianism Meets Egalitarianism. The problem isn't completely intractable. We are assuming that one utility can be linearly transformed into the other. This means that we can turn a value in one agent's utility system into a value in the other agent's utility system just by running it through an equation like $y = m x + b$ .

The problem, of course, is that we don't know what $m$ and $b$ actually are. We can't look at the utility values that an agent assigns to world outcomes to figure it out either. We know we can map utility values between unit systems, but the mapping from world-states to utility system is not the same at all. You like pizza a lot, I like sushi a lot. Just because our "a lot"s are comparable doesn't mean that telling someone I like sushi a lot gives them any info about whether you like pizza. Similarly, knowing what we both like doesn't give enough info to compute the transformation between utilities.

We don't really need the scale factor $m$ (as long as it's positive), since we're taking an argmax. We can just do the multiplication to get a result, then find the action that maximizes that result. Since we don't care about the real maximized utility value (just the action that achieves it), it doesn't matter if the units on our utility are a complicated mess.

As an intuition pump, imagine Alice gets paid in dollars and Bob gets paid in Yen. The outcome unit is dollars*yen, but that doesn't matter. More is more in both unit systems, and we don't need to convert yen to dollars to figure out what the value maximizing outcome is.

Garrabrant identifies the parameter $b$ (the zero-point) as being the most important for making utility comparisons. With dollars and yen, 0 is the same in both systems. With utilities, your +5 could be my -2. If we multiply those, the result is negative. Garrabrant talks a bit about this, but even in his sequence he notes that it's a problem.

The way Nash Bargaining gets around this problem is comparative in nature. Instead of using the utilities each person has for the next state, it uses the delta of utility each agent has between their BATNA's state and the cooperative state. Then improvements are always positive (even if the improvement is from terrible to just bad).

Dealing with losses in Nash bargaining

Nash bargaining explicitly requires agents to have a BATNA. That BATNA sets the "0 point" of the utilities they're trying to maximize. Every alternative the agents consider during the choice phase is considered against this BATNA, by subtracting the new utility from the BATNA utility. These deltas are then used in the Nash bargaining product to determine action choice.

What happens if two agents both would have a negative delta? Some choice gives them both less utility than their BATNA. The product of two negatives is positive, so an argmax may select it. This is why the Nash bargaining explanation above made such a big deal about filtering out options that an agent saw as negative delta utility. Without that filtering stage, our argmax wouldn't work. Formally, this means we're looking at a constrained optimization problem, with the constraint given by the requirement that outcomes need to have positive delta utility for each agent.

This is concerning. We're trying to build up from some foundational pieces to a larger theory of decision making. Nash bargaining side steps what seems like a large issue by introducing comparison points and option filtering. Even if we're willing to accept the added complexity of this scheme, we may worry that future types of problem wouldn't be amendable to it.

Garrabrant kind of waves this away by saying that GR works well for "natural features" of the world like dollars or coconuts. In other words, we're assuming utility can't go negative. I don't feel that this works well, because even if we constrain ourselves to only natural features that can't go negative, we can still have our stash go down. In Garrabrant's defense, this is explicit in his article on the geometric expectation and is called out as a major limitation in his last post.

Much of Garrabrant's discussion is not about iterated decision making, so he can get away with handwaving it. If (as he starts out suggesting) you're going behind a veil of ignorance and deciding who will get what, the veil provides a natural 0 point of nobody getting anything.

Our agent models are a bit more complicated, since we're assuming the agents act through time in a world that is continually changing (similar to a POMDP). Each agent has to do something, even if that thing isn't to cooperate with the other agent. We have sort of artificially brought the agents together to perform this Nash bargaining procedure, but they're always free to ignore the result and do something that's better for them.

This opens up a lot of questions. What if the agents form a contract? What if they use threats against each other? What if they actually fight over the course of an iterated game? I'm very curious about how these situations will go, but it's a bit outside the scope of this article. Let's continue with the assumption that the agents do manage to find a positive sum way of collaborating.

Teams

Extending the above argument to more than two collaborators seems straightforward, but there are a few thorny issues to address.

We can naively extend Nash bargaining to teams of agents that are all collaborating simply by including all agents in the product. This makes the definition of the collective Nash bargaining team:

$a_{c} = {argmax}_{a \in Δ A} \prod_{i} (E_{s^{'} \sim W_{i} (a)} U_{i} (s^{'}) - U_{b i})$

This is identical to the two party Nash bargaining solution if there are two agents.

The issue here again comes down to the 0 points of each agent's utility function. In the two party case, the BATNA is obvious because agent 1 can either work with agent 2 or not. With more than 2 agents, you run into cases where 1 & 2 want to work together, but 3 doesn't. The more agents you combine, the bigger this issue gets.

There could be fairly extreme faction building within groups of agents on the team. In the extreme case, the team splits into several teams along faction lines. This seems like one of the areas that needs a lot more investigation. To move forward, we can make a simplifying assumption that if all agents don't agree, then they each go forward with no collaboration at all. Under this assumption, the BATNA for each agent is again based only on its own individual action, and not on an assumption of faction-building.

There's another wrinkle we should address here: what if some agents get more votes than other agents in what the team does? This is common in e.g. corporations where some people can hold more voting shares than others. It can happen in politics where states have proportional representation within the federal government.

Up to this point we've been modeling the Nash bargaining solution as giving equal weight to all agents. If we want to weight agents differently, this is actually quite easy. We take each factor in the utility product to the power of its agent's weight.

In equations, this is:

$a_{c} = {argmax}_{a \in Δ A} \prod_{i} (E_{s^{'} \sim W_{i} (a)} U_{i} (s^{'}) - U_{b i})^{p_{i}}$

With a uniform weighting of $n$ agents, we have $p_{i} = 1 / n$ for all $i$ . We can choose whatever distribution we want for these agents, and Garrabrant's series spends a lot of time arguing that this selection has philosophical underpinnings (such as representing how likely you are to be a certain person if you're behind the veil of ignorance).

In cases like corporate share ownership where some agents get more votes, but votes come in discrete units, you could represent the Nash bargaining product as having one factor for each vote. Since some agents get multiple votes, their preference gets repeated for each vote that they have. This can be represented as taking their single preference to a power, where the exponent is their number of votes. The argmax will produce the same result even if we take the entire function to some other power, so instead of taking each agent's scoring function to the power of its vote number we can take it to the proportion of its votes in the vote total.

Now we're looking at full geometric expectation. Finally we've come to the "geometric" part of Garrabrant's geometric rationality.

What's more, this new team acts like a single agent. Remember back to our original definition:

it observes the world (using all of the sensors that the sub-agents have)
it orients to its observations (in a distributed fashion within each sub-agent)
it decides what to do (by allowing bargaining among all of the sub-agents over a mixed strategy to sample from)
it acts upon the world (via each sub-agent performing its own part of the sampled action from the mixed strategy)

This is not an EUM agent though. It doesn't optimize in the same way as an EUM agent, and will choose different actions than an equivalent agent that somehow was given its same goal. I'll call agents that act in this way "Geometrically Rational" agents, or just GR agents.

This could be represented as an EUM agent if we took the log of the scoring function. That would turn our products into sums and our exponents into products. We could solve this EUM agent over log-value and get the same answer as our GR agent, but the interpretation becomes trickier. As Garrabrant says, this formulation looks at optimizing log of value. The question of why you would want to optimize log-dollars (or log of any other thing you value) is a confusing one without a bargaining perspective.

An Agent Without vNM Rationality

As mentioned, the GR agent is not an EUM agent. Many would consider this a problem, as EUM agents are the only kind of agent that fully satisfies the von Neumann Morgenstern axioms. These are:

Completeness - given two options, an agent is able to provide a preference ordering between them (even if it's a tie)
Transitivity - if an agent prefers A to B, and prefers B to C, then the agent also prefers A to C.
Continuity - if an agent prefers A to B and B to C, then there's some probability p where the agent would think that pA + (1-p)C ties with B.
Independence - if an agent prefers A to B, then for all probabilities p the agent also prefers pA + (1-p)M to pB + (1-p)M. Note that it doesn't matter if the agent likes M better or worse than A and B, as M is applied equally to both sides.

Since the GR agent is not an EUM agent, it must violate at least one of these axioms. Garrabrant wants to do away with the Axiom of Independence.

At the risk of belaboring the obvious, there's no way we can prove that getting rid of Independence is the right thing to do. von Neumann and Morgenstern take it as an axiom, and just assume that it's worth doing. Garrabrant doesn't think so.

EUM agents think taking certain actions are best (such as betting it all on the highest EV outcome). GR agents think taking other actions are best (such as Kelly betting, as we'll soon see). It's sometimes fun to watch people argue about expectation maximization vs Kelly betting, because often each side of the argument thinks that their side is just obviously right. Garrabrant does the best I've seen at describing the issue here.

In the end, we just have to look at the two types of agent and see what they would do. Then we can pick the one that seems most consistent with wisdom, from our own personal perspective.

von Neumann and Morgenstern included the axiom of independence because it enforces consistency. I am kind of analogizing this as a Markovian property. Once you see what decision you're presented with (you have a specific world-model conditioned on your observations), you'll always make the same decision. It doesn't matter how you got there.

Dutch book arguments demonstrate that if you decide in a Markovian way, you need a special kind of consistency in order to avoid being taken advantage of. Here's how Dutch book arguments work:

assume an agent prefers A to B
assume it prefers the lotteries (0.5B + 0.5C) to (0.5A + 0.5C)
assume it observes the world and finds that it has to make a choice between the mixture (bullet 2)
it decides on the mixture option that includes B
assume the lottery resolves such that outcome B occurs (and not C)
a bookie comes along and says "you can pay me to be allowed to choose A instead of B"

An EUM agent would never encounter this scenario, because the first two assumptions are incompatible with the axiom of Independence. According to vNM, in order to avoid being taken advantage of by a Dutch book agents must update their world models using Bayesian probability and assign utilities using vNM's axioms.

A GR agent may encounter this scenario. When the GR agent sees the lottery resolve such that outcome B occurs (not outcome C), it then gets the choice to switch to A. The agent has to track which mixed strategy it's executing to avoid being Dutch booked, instead of treating each decision point in isolation.

According to Garrabrant, a GR agent may encounter this scenario but would decline to change their choice when the bookie offers the option.

In arguing against the axiom of independence, Garrabrant presents a parable. A married couple is deciding where to move. The agent is the married couple in this case, and not either spouse. The agent would prefer moving to Boston over Atlanta, which matches the husband's preferences but not the wife's. Given some probability of moving to SF, the married couple agent would change its preference from Boston to Atlanta. This is because SF satisfies the husband's utility more than Boston, so in fairness they weight Atlanta higher than Boston to give more utility to the wife in expectation. They are distributing the expected utility gain of the SF option between the two of them by rebalancing their agreed strategy for choosing between already existing options. The rebalancing is necessary because there is a disagreement that is internal to the gestalt "married couple" agent.

To me, this is wisdom.

An EUM agent doesn't change its choice under a Dutch book because it's indifferent to the new information (by the axiom). A GR agent doesn't change its choice because it is attempting to be fair to its sub-parts.

This also explains why a GR agent would change preferences if given a lottery with an "irrelevant" outcome. That irrelevant outcome can have its additional utility distributed among sub-parts in a non-uniform (more fair) way.

Unfortunately, this seems either intractable or non-Markovian. Discovering new options can swap preferences between already known options for a GR agent. Garrabrant suggests that some kind of updateless decision theory might save this. I don't know about that, so let's just assume that ditching the axiom of independence is fine and see what happens.

Coalitions: Teams of teams

Now that we have a new kind of agent, we can ask how it would act alongside other agents. If we wanted this agent to act alongside an EUM agent, the natural way is to just fold that EUM agent into the team alongside the other sub-agents. But what if we want a GR agent to collaborate with another GR agent.

Agents Bob, Brenda, and Bernice team up and start society B, which acts as a GR agent. Agents Charlie, Clarisse, and Chester team up and start society C, which also acts as a GR agent. Each of these societies satisfies the needs of its members through Geometric Rationality as above.

After some time, societies B and C decide to collaborate on a project. How should those societies aggregate their high level decisions and knowledge? Will that still serve their individual member agents? Let's try Nash bargaining again. It worked well last time.

In this case, the societies don't have a utility function. Remember that we're violating one of the axioms needed in order to construct a utility function. For EUM agents, the utility function is what we take the argmax over to decide on an action. For GR agents like these societies, we'll consider the function we're argmax'ing over to be our scoring function. This scoring function will be used in the Nash bargaining construction.

For the GR agent of society B, consider its scoring function to be:

$s_{B} = \prod_{i \in B} (E_{s^{'} \sim W_{i} (a)} U_{i} (s^{'}) - U_{b i})^{p_{i}}$

There's an important wrinkle here that we should iron out. The above scoring function is for the individual GR society, and the BATNAs in it are for each sub-agent to go its own way instead of collaborating with society B.

Since we have two GR societies, and we're Nash bargaining, we should choose actions that maximize the product of our two scoring functions less their BATNA. When we do this, the BATNAs we're looking at are society level BATNAs, not sub-agent level BATNAs.

$a_{c} = {argmax}_{a \in Δ A} (\prod_{i \in B} (E_{s^{'} \sim W_{i} (a)} U_{i} (s^{'}) - U_{b i})^{p_{i}} - U_{b B}) \cdot (\prod_{j \in C} (E_{s^{'} \sim W_{j} (a)} U_{j} (s^{'}) - U_{b j})^{p_{j}} - U_{b C})$

The action space $A$ here represents the joint actions of both societies B and C. Since societies act through the individual actions of their agents, the actual action space is given by the joint actions for all agents in both societies.

The society level BATNAs are not utility values, since they're not from a utility function. Instead, they are the result of a product like $\prod_{i \in B} (E_{s^{'} \sim W_{i} (a_{i}^{B})} U_{i} (s^{'}) - U_{b i})^{p_{i}}$ . In this case, $U_{b i}$ represents agent $i$ 's utility given its best action if it separated from society B. The value $a_{i}^{B}$ represents agent $i$ 's best action if it stayed in society B, but society B did not collaborate with society C. The values are just constants from the perspective of the argmax.

To simplify this, let's multiply it out. We'll reindex the products at the same time.

$a_{c} = {argmax}_{a \in Δ A} \prod_{i \in (B \cup C)} (E_{s^{'} \sim W_{i} (a)} U_{i} (s^{'}) - U_{b i})^{p_{i}} - U_{b C} (\prod_{i \in B} (E_{s^{'} \sim W_{i} (a)} U_{i} (s^{'}) - U_{b i})^{p_{i}}) - U_{b B} (\prod_{j \in C} (E_{s^{'} \sim W_{j} (a)} U_{j} (s^{'}) - U_{b j})^{p_{j}}) + U_{b B} U_{b C}$

The addition of the constant $U_{b B} U_{b C}$ doesn't impact the argmax at all, so we can disregard it immediately. The other two cross terms involving society-level BATNAs make things pretty complex. If those society level BATNAs were 0, this would be much simpler. Unfortunately, the only way to make them 0 is if at least one agent from each society would be better served by not being in its society in the first place.

This more complicated result shouldn't be too surprising to us, since we already had to wave away certain factionalism assumptions when we were talking about forming our multi-agent team. If two teams collaborate, they're bringing that factionalism back in at the outset.

Scale Free Agent Design

We saw above that Nash bargaining between societies of agents does not reproduce the math of those agents all Nash bargaining amongst each other. I think this is not what Garrabrant hoped for. He argues in a few places that GR is scale free. He seems very hopeful that any number of agents can join up in coalitions, and combining their goals proceeds using the same rules. He doesn't want it to matter how you aggregate their goals.

Why would an individual EUM agent care about whether it was part of a coalition or a single large team? Given unlimited compute and thinking time, maybe it wouldn't. Agents don't have unlimited resources, so institutions may form that would approximate GR teams. If so, these institutions may function optimally with a specific number of sub-agents. This could lead to partitioning of EUMs into teams. These teams would develop internal institutions to reduce transaction costs, which would result in cross terms in value scoring when teams collaborate.

Some of Garrabrant's writings seem to be driving towards some way of resolving gerrymandering. If only we could use Nash bargaining to let people combine preferences, we would find a new voting scheme that makes gerrymandering a thing of the past. While I personally think gerrymandering can be fairly destructive, thinking through GR has made me realize the point of it: different neighborhoods may want representation of their own specific needs. Gerrymandering can act as an institutionalizing force that (steelmanning here) is supposed to reduce transaction costs for a given neighborhood to represent itself. In practice gerrymandering doesn't seem to do this, but if we assume we could magically replace it with more natural coalitions then the cross-terms in the Nash bargaining would persist and would change optimal actions from what they would be under an all-inclusive single team.

Building Blocks of Agent Design

Up to this point, we have built up our teams out of Expected Utility Maximizers. It would be very nice if we could somehow replace that expected utility maximization with another geometric maximization. Can we do this?

Richard Ngo is working towards something like this, and explicitly references the Geometric Rationality work. He refers to Minsky's old book Society of Mind, which makes the very natural (for today) argument that people are made up of sub-agents. This also resonates with the Internal Family Systems approach to therapy.

I don't know what humans are made of, but let's try to address the idea of making GR agents from things that aren't EUM agents. This isn't quite what Garrabrant does. He seems to approve of a union of GR agents and EUM agents, and talks about the usefulness of the arithmetic-mean/geometric-mean boundary. He describes the the smaller EUM agents as being places where fairness isn't required. Equivalently, within an EUM agent, sub-agents are allowed to exploit each other.

If I think about a human mind as a society of sub-agents, I admit that they all survive if one survives (under nominal conditions). From that perspective, there's a forced utility sharing amongst them all. This could imply that these sub-agents exploiting each other is fine, because there's a floor to the degradation in utility any one of them can experience. On the other hand, via introspection I know that some parts of me value certain things in the world more highly than my own survival. Perhaps those parts aren't necessarily on board with Garrabrant's drawing of the AM/GM boundary.

Whatever the case for a human, let's try to make an artificial agent that's GR all the way down. We will do away with the AM/GM boundary in theory, if nowhere else.

At some point, we need some base agent. Something that is not a GR agent built out of other agents. How can we get a GR agent that has no sub-agents?

Possibly the simplest sub-agent we can look at is something that just looks at the world and counts things in it. Its scoring function can just output the count:

an agent that values having money: $1 is worth 1 unit of value
an agent that values having paperclips: 1 paper clip is 1 unit of value
an agent that values the number of rings its little sonic avatar has: 1 ring = 1 unit of value

These natural features of the world, as Garrabrant would call them, form a straightforward way to prefer worlds to each other. They have natural 0 points and natural scales. If two natural feature scoring functions are over the same "thing" in the world, it is easy to transform between their units.

An objection arises: what about cases where more isn't always better? What about set point regulators? For example:

an agent values maintaining body temperature at 98.6 degrees F: 1 degree deviation is -1 unit of value
an agent values maintaining blood glucose at 85 mg/dL: 1 mg/dL deviation is -1 unit of value
an agent values maintaining social interactions at 30 hours per week: 1 hour per week deviation is -1 unit of value

Such set point agents seem required for a lot of the ways that people and other agents work (remember the thermostat was one of the first examples of an agent in this article).

Consider the task of regulating a human's temperature. Model it as an agent with the goal of keeping the temperature at 98.6 degrees F. We will construct a GR agent that matches this set-point goal from two natural feature utility functions.

a utility function that values increasing temperature: score-1 = temperature
a utility function that values decreasing temperature: score-2 = (2*98.6 - temperature)

Perhaps score 1 measures something like robustness against fungal infections, and score 2 measures something like avoiding heat stroke.

The product of these scores gives a parabolic scoring function with a maximum given by temperature of 98.6. This has a quadratic penalty for deviations instead of linear as above, but I think it works for the idea of a set point regulator.

The problem with this construction is that one of the utility functions has a negative slope, and its score will go negative for temperature values that are high enough. If we were to use our GR agent design to Nash bargain between these scoring rules, one of the agents would decline to cooperate in this scenario. This would cause the grouping of these two agents to fail to regulate for temperatures outside of 0 to 197.2.

For a set point regulator of a human body's temperature, this seems fine. The human won't be alive at the temperature where this breaks down anyway. In reality, whatever controls the set point for human temperature probably depends on many factors, not on just two. For a theory of decision making grounded in a single kind of agent, this is still a limitation that would be better avoided.

This is not the most parsimonious way of grounding our agents in something real, but it seems to unblock us in carefully constructed regions of value-space. It still remains unclear what world model or action space such an agent would be using. Does one of the agents control the shiver response and the other control the sweat response?

I would want to see a more thorough exploration of this area of Geometric Rationality.

The benefits of EUM

The idea of grounding GR agents in something that's not an EUM is my own, not Garrabrant's. Garrabrant would argue against doing this, I think. He has a post on drawing the boundary between arithmetic mean maximization and geometric mean maximization, where he says "if you make your arithmetic clusters too small, you could end up taking actions in proportion to their utilities, and effectively never maximizing at all."

Imagine a GR agent that Nash bargains between 10 different natural feature agents. Imagine that each natural feature agent cares about only one thing (number of dollars, number of bananas, number of pearls, etc.) and the natural features don't overlap with each other. If the available actions could only result in getting one resource at a time, they may choose a mixed strategy that gives each act the same probability. Outcomes are simply distributed via probability, with no outcome being maximized.

It seems Garrabrant wants to deal with this by just having EUM agents in certain places, but to be honest I would like a theory with fewer moving parts. I'm not certain it's possible, but I do think it's worth exploring more. I'd be interested in exploring what behavior comes from competing GR agents rather than cooperating agents, and perhaps see if some behaviors don't give similar maximization performance.

Benefits of Geometric Agents

After all of that work, we have reached the point where we can describe:

Geometrically Rational agents made up of smaller sub-agents
those sub-agents may also be GR agents
they may also be natural feature agents, which value simple linear scoring of single features in the world
GR agents select actions that maximize the product of scores provided by their sub-agents via a constrained optimization of the geometric mean

What does all of this get us? Hopefully something useful after slogging through all that.

Bayes' Rule

Geometric maximization produces Bayes' rule when used to condition on evidence. This is a bit tangential to the agent formulation we've been going over, but it's such a nice feature of the geometric worldview that it's worth going over. I'll follow Garrabrant's explanation pretty closely here.

Imagine you have a world model $W$ that's represented as a probability distribution $P 0$ over ways that the world could be. Normally when you receive some observation $O$ , you could condition your world-model on it via Bayes rule. To simplify things a bit, we can say that the set of possible world states that have non-zero probability for O are given by X. Instead of explicitly using Bayes' rule, let's find the probability distribution $P 1$ such that

$P 1 = {argmax}_{P \in Δ W, P (X) = 1} \prod_{w \in W} P (w)^{P 0 (w)}$

The product in that equation is just the geometric expectation of our new probability as measured by our original probability. If we wanted to reframe this in an agent formulation, we could say that we're just Nash bargaining among "probability agents" where each probability agent wants to maximize its own value, but their negotiating power is given by how much they were predicted prior to the observation.

Let's find out what this probability agent would do by solving for the argmax.

Of course we do have to worry about the fact that our argmax is not over the entire simplex of original worlds. It's limited by the constraint that the probability equals 0 for worlds that are impossible given our observation. In other words, P(X) = 1 and P(!X) = 0. That means that some $P (w)$ will be 0. These 0 probabilities for worlds that we don't observe wipe out the product.

Garrabrant deals with these 0s by taking the limit as the probability of the observation approaches 1.

$P 1 = {lim}_{b \to 1^{-}} {argmax}_{P \in Δ W, P (X) \geq b} \prod_{w \in W} P (w)^{P 0 (w)}$

Let's solve this as an optimization problem. I'll do so by taking the logarithm, but don't be confused. We aren't using log-probability as our "value" here. It's just a way of solving for the argmax of the product.

We want to maximize $\sum_{w \in W} P 0 (w) log (P (w))$ . This will be subject to two constraints:

$\sum_{w \in W} P (w) = 1$
$\sum_{w \in X} P (w) = b$

We'll use the method of Lagrange multipliers to solve this. I won't belabor the details here, but the result of the procedure is a set of two Lagrange multipliers with limits

$λ_{1} + λ_{2} \to P 0 (X)$
$λ_{1} \to \infty$

We also have the set of equations:

$P (w) = P 0 (w) / (λ_{1} + λ_{2}) \to P 0 (w) / P 0 (X)$ for $w \in X$
$P (w) = P 0 (w) / λ_{1} \to P 0 (w) / \infty \to 0$ for $w \notin X$

Notice that these updated probabilities match the outcome of applying Bayes rule.

Garrabrant goes on to discuss some different things you can do with this idea. In particular, he discusses intervening on your beliefs to explore counterfactuals. This is very interesting to me, and I plan to think about it in the context of Judea Pearl's approach to causation.

Explore/Exploit via Thompson Sampling

One of the main places that OODA agents of the form I showed above are used is in reinforcement learning. Often, EUM agents of some form are used there. This requires some ad-hoc tweaks, because EUM agents never explore. Remember, they always pick the action that maximizes their expected reward, instead of doing something that could get more information at the cost of (possibly) lower reward.

To account for this, epsilon exploration is sometimes used. This is a hack tacked on top of EUM agents where the agent's preferred action is chosen 1-epsilon percent of the time, and with epsilon probability some other action is chosen. To encourage exploration at the beginning of training but not at the end, epsilon can change over time. This is not very elegant at all.

GR agents are much better at making the explore/exploit tradeoff, because it turns out that they can implement something called Thompson Sampling. The gentlest traditional intro to Thompson Sampling is Allen Downey's, which I highly recommend.

Here's the normal presentation of Thompson Sampling for an n-armed bandit problem:

store probability distribution of output for each arm (so n distributions)
sample an output from each distribution (so you now have n value estimates)
choose the arm with the highest value estimate
observe reward from that choice
use observed reward to update the probability distribution for that arm via Bayes' rule
go back to 2

This has several nice properties. For one thing, it performs each action with exactly the probability that it may be the best action to perform (exploring when it makes sense to). For another, it's computationally tractable to implement this.

This description of Thompson Sampling doesn't look much like the way a GR agent works. Let's see how this is actually a special case of Geometric Rationality. We'll make a GR agent that is also used in an n-armed bandit problem. I'll follow a slightly different path than Garrabrant does. My presentation will be more geared around regenerating standard Thompson sampling, whereas Garrabrant takes a more philosophical and generic approach.

Model the bandit world as an array of probability distributions. Each element of the array represents an individual bandit arm. The probability distributions in each element of the array are continuous, and represent possible payouts from pulling the specified arm.

The GR agent we will construct will act as though each of these arms has a sub-agent advocating for it. The agent will Nash bargain among the sub-agents for each arm. In order to Nash bargain among these subagents, we need to know each agent's scoring function, BATNA, and voting power.

Scoring Function: Instead of valuing dollars, the agents value influence. They want their specific action to be taken.

BATNA: Since exactly one action must be taken by the GR agent, the BATNA for the sub-agents is 0

Voting Power: We would like to assign more voting power to arms that are expected to produce higher results. Given our array of probability distributions, we can actually calculate the probability that a specific arm will produce the highest result. It's given by the function

$p_{n} = \int_{- \infty}^{\infty} f_{n} (x) \prod_{j \neq n} F_{j} (x) d x$

$f_{n} (x)$ is the probability of $x$ for arm $n$
$F_{j} (x)$ is the probability of obtaining a result less than or equal to $x$ for arm $j$ (the CDF)

In words, we're summing over all possible outcomes that arm n could give. We weight the sum by the probability of that value. We also multiply it by the probability that all other arms give a lower value.

While this might be difficult to compute in practice, it's conceptually straightforward. We will use these values as the voting power for each sub-agent.

Now let's see how our GR agent would decide on an action.

$a_{c} = {argmax}_{a \in Δ A} \prod_{i \in C} (a [i])^{v_{i}}$

where

$a$ is a discrete probability distribution with a value for selecting each individual bandit
$i \in C$ is a specific sub-agent from the GR coalition
$v_{i}$ is agent i's voting power in the GR coalition
$a [i]$ is the probability of selecting action $i$ given by action distribution $a$

Once an action distribution is selected, an action is drawn from that distribution. The specified Bandit is activated, and each agent updates its world model via Bayes' rule. The GR agent itself updates the voting power for each sub-agent based on their outcome predictions. Then the whole process repeats.

Here we're implicitly using Garrabrant's Arithmetic/Geometric boundary. The arithmetic maximization is implicit in the problem setup, where each sub-agent explicitly prefers for the bandit it represents to be chosen. The geometric maximization is given by the Nash bargaining.

Let's find out what the argmax of the above equation actually is. We can take logs and formulate our problem as a constrained maximization problem. We want to maximize $\sum_{i \in C} v_{i} log (a [i])$ subject to $\sum_{i \in A} a [i] = 1$ . We'll again use Lagrange multipliers to solve this.

$L = \sum_{i} v_{i} log (a [i]) - λ (\sum_{i} a [i] - 1)$

Taking the partial derivative with respect to each element $a [i]$ and setting to 0 gives us $v_{i} / a [i] = λ$ for each $i$ . Since we know $\sum_{i} a [i] = 1$ , we can say that $\sum_{i} v_{i} / λ = 1$ , or $λ = \sum_{i} v_{i}$ . Since our voting power is just a probability distribution, this means that $λ = 1$ . Plugging that in to our partial derivative, we see that $a [i] = v_{i}$ . We are assigning the same weight to each action as the probability that this action is selected by our collection of agents. This shows that GR agents of this form reproduce Thompson sampling.

Remember how we assigned voting power for the subagents? We set voting power to be equal to the probability that the subagent would produce the highest outcome. That voting power directly becomes the probability of choosing that action. In the limit that one arm is guaranteed to provide a higher value than any other arm, it's $p_{n}$ will go to 1. That means our GR agent would always choose that guaranteed best action (as we would want).

Kelly betting

Finally we get to the reason I was interested in geometric rationality in the first place: it recommends Kelly betting from a linear scoring function. A lot of recommendations for Kelly betting claim that it's a good idea if your value is logarithmic in wealth, but to be honest I don't find that compelling at all (should I not Kelly bet if the gains from betting are measured in something other than dollars?).

Garrabrant's justifications for using GR in betting boil down to various forms of internal fairness. Perhaps we think of each predicted outcome as owning its own percentage of the agent's total wealth pool. Given this, Garrabrant says it makes sense to Nash bargain between these predicted-outcomes rather than allowing one to seize all of the agent's internally-communal resources.

It's well known that Kelly betting is equivalent to maximizing log-wealth instead of wealth. With the definition of GR, it would be trivial to show that simply maximizing the geometric expectation of wealth is equivalent to Kelly betting. Instead, what I want to show here is that Kelly betting is equivalent to Thompson sampling when the action-space is continuous and sub-agents can trade to minimize transaction costs.

Normally, Thompson sampling is performed for n-armed bandit problems where only one solution can be picked at a time. If the problem constraints are changed and an agent is allowed to bet on possible outcomes instead of selecting specific actions to take, the outcome will be Kelly betting.

To demonstrate that the above formulation of Thompson sampling is equivalent to Kelly betting, we're going to set up a GR agent that's playing a betting game on coin tosses. Following Garrabrant, we'll assume bets can be made at even odds for either heads or tails. We'll also assume that the coin may not be fair.

We can consider this to be similar to a 2-armed bandit problem. One of the arms is "bet heads" and the other is "bet tails". To match up with the Thompson sampling paradigm above, we'll create a 2 element array of probability distributions. The distribution of each arm will be $[p_{win}, p_{loss}]$ . The two distributions we'll use will be mirrors of each other.

Where we diverge from the standard Thompson sampling formulation is in actions. Instead of having to select one arm and bet everything on it (as in standard Thompson sampling), the GR agent is allowed to bet any amount of money that it has on one or both outcomes of the toss.

Like Thompson sampling, we will start with a set of 2 EUM agents. One EUM has a world model that focuses only on heads, the other only on tails. For the heads agent, its world model predicts $p_{h}$ probability of winning and $(1 - p_{h})$ probability of gaining nothing. The tails agent is the mirror of this.

The value functions and BATNAs of the subagents match the Thompson sampling value functions.

The voting power is very easy to calculate for this kind of problem. The heads sub-agent has voting power of $p_{h}$ and tails has voting power of $p_{t}$ . We don't even need to do any integrals.

If we follow the Thompson sampling mathematics, we see that the selected action distribution is given by $[a_{h} = p_{h}, a_{t} = p_{t}]$ . The probability of choosing the action to bet on heads is equal to the probability that heads wins (as predicted by the agent's world model).

If the GR agent were doing normal Thompson sampling, it would sample from that distribution and use the sample as its action. This problem is different though, because it isn't limited to going all-in on one action. It can instead distribute all its wealth proportionally to the action distribution and do everything "in parallel".

Let's consider a specific example to make this concrete. The GR agent as a whole predicts a 60% chance for heads and a 40% chance for tails, and it starts with a wealth of $100. Assume the bets are both 1:1 odds. Given this, betting $1 on heads and $1 on tails is the same as keeping the money. No matter what happens, the agent ends up at net-zero.

It can waste less time chatting with the bookie by balancing these "wasted bets" internally. For this example, distributing its wealth across bets in proportion to the Thompson sampling suggestion would mean betting $60 on heads and $40 on tails. Instead, it can cancel out all of the overlapping bets and simply bet $20 on heads while keeping the rest of its money in reserve.

Now that we've seen a concrete example, let's make it more abstract. Instead of assuming we can make both bets with 1:1 odds, assume that each bet has distinct odds. These odds are the payout offered by the bet, and are often represented by the variable $b$ . With odds of $b$ , you would receive $(1 + b)$ times your bet amount back if you won (so if you bet $1 you would get back $2). Odds are normally set according to the probability of an outcome, so the outcome of a fair coin would have odds of $b = 1$ for both sides, and for a coin that came up heads 60% of the time the odds would be 2/3. We also have $b_{h} = 1 / b_{t}$ .

Assume without loss of generality that the coin has a higher chance of landing on tails (just switch labels if it's actually heads). Thompson sampling suggests we bet a fraction $p_{h}$ of our wealth on heads and $p_{t}$ on tails, and we know that $p_{h}$ is less than $p_{t}$ . To internally balance our bets, we need to avoid betting $p_{h}$ on heads and also remove some amount from our $p_{t}$ bet that's cancelled by our $p_{h}$ bet not going through.

If we had bet a fraction $p_{h}$ of our wealth on a heads outcome, that would have resulted in a gain of $b_{h} * p_{h}$ on a win. We'll let the tails-betting sub-agent pay the heads-betting sub-agent to not bet, removing that fraction from the tails bet to compensate the heads bet for abstaining. In order to make the heads agent not bet, the tails agent has to pay as much as the heads agent would have gotten if it bet and won.

Now let $p = p_{t}$ , $q = p_{h}$ , and $b = b_{t} = 1 / b_{h}$ . Then we want to bet 0 on heads, and on tails we want to bet $p - q / b = (p b - q) / b$ . This is the Kelly betting edge/odds recommendation.

Given Geometric Rationality, then what?

I like the ideas of Geometric Rationality, and I think they have promise for shedding light on optimal agent design.

Expected Utility Maximization is conceptually simple. Update your world model with Bayes rule, pick the action that gives you the best outcome, go all in on it. There are only a few variables to determine, and it's computationally tractable in many cases. It also recommends actions that I find intuitively strange (such as not betting Kelly) and it doesn't handle ambiguity in its world model well.

Geometric Rationality is more complicated right now, but it shows some promise for ultimately being simpler. In particular, we may not need to assume Bayes rule, as we can get it free by Nash bargaining over predictive power. We do currently need to do more complex work on scoring functions. With EUM the scoring function is often very obvious from a given problem statement. With GR it sometimes isn't. Consider Bayes rule, Thompson sampling, and Kelly betting. All of these use some kind of "representation" in the outcome as the scoring function, which is kind of weird. The simplex over actions gives more flexibility, but also makes computing answers much more expensive.

I want to consider how a GR perspective might recommend changing voting methods or reorganizing institutions. How can we make things more fair for people without moving farther from the Pareto frontier? This theory still seems likely to be fruitful to me, though I also think there are some major questions to resolve.

Geometric Rationality Open Questions

How do we determine the BATNA for many-agent collaborations?
Is there a principled way of computing BATNAs for each agent in a team, assuming unlimited compute?
Nash bargaining doesn't involve agents sharing world-models or educating each other, but that seems like a huge part of human bargaining. How can we better represent world-model updates in the GR framework?
Garrabrant's example of the married couple moving to Boston seems to depend highly on the institution that they have built around their partnership. Can we formalize this?
What sorts of institutions might GR agents come up with to decrease transaction costs?
GR agents avoid Dutch books by adhering to fairness among their sub-parts. How can this be formalized?
How can GR agents be made consistent to truly prevent Dutch books?
Is there a way to combine both GR agents and natural feature agents which allows them to use the same decision procedure? As presented here, GR agents don't collaborate if doing so is worse than their BATNA, but natural features agents always do.
Why do sub-agents have scoring based on representation for Thompson sampling?
Bayes rule, Thompson sampling, and Kelly betting are all simplifications of GR that allow decisions to be made without solving explicit optimization problems. What other ways could we apply GR like this?

LESSWRONG
LW