*Based on discussions with Stuart Armstrong and Daniel Kokotajlo.*

There are two conflicting ways of thinking about foundational rationality arguments such as the VNM theorem.

- As direct arguments for normative principles. The axioms are supposed to be premises which you'd actually accept. The axioms imply theories of rationality such as probability theory and utility theory. These are supposed to apply in practice: if you accept the axioms, then you should be following them.
- As idealized models. Eliezer compares Bayesian reasoners to a Carnot engine: an idealized, thermodynamically perfect engine which can never be built. To the extent that any real engine works, it approximates a Carnot engine. To the extent that any cognition really works, it approximates Bayes. Bayes sets the bounds for what is possible.

The second way of thinking is very useful. Philosophers, economists, and others have made some real progress thinking in this way. However, I'm going to argue that we should push for the first sort of normative principle. We should not be satisfied with normative principles which remain as unachievable ideals, giving upper bounds on performance without directly helping us get there.

This implies dealing with problems of bounded rationality. But it's not the sort of "bounded rationality" where we set out to explicitly model irrationality. We don't want to talk about partial rationality; we want notions of rationality which bounded agents can *fully* satisfy.

# Approximating Rationality

In order to apply an idealized rationality, such as Bayesian superintelligence, we need to have a concept of *what it means to approximate it*. This is more subtle than it may seem. You can't necessarily try to minimize some notion of distance between your behavior and the ideal behavior. For one thing, you can't compute the ideal behavior to find the distance! But, for another thing, simple imitation of the ideal behavior can go wrong. Adopting one part of an optimal policy without adopting all the other parts might put you in a *much worse* position than the one you started in.

Wei Dai discusses the problem in a post about Hanson's pre-rationality concept:

[...] This is somewhat similar to the question of how do we move from our current non-rational (according to ordinary rationality) state to a rational one. Expected utility theory says that we should act as if we are maximizing expected utility, but it doesn't say what we should do if we find ourselves lacking a prior and a utility function (i.e., if our actual preferences cannot be represented as maximizing expected utility).

The fact that we don't have good answers for these questions perhaps shouldn't be considered fatal to [...] rationality, but it's troubling that little attention has been paid to them, relative to defining [...] rationality. (Why are rationality researchers more interested in knowing what rationality is, and less interested in knowing how to be rational? Also, BTW, why are there so few rationality researchers? Why aren't there hordes of people interested in these issues?)

Clearly, we have *some* idea of which moves toward rationality are correct vs incorrect. Think about the concept of cargo-culting: pointless and ineffective imitation of a more capable agent. The problem is the absence of a formal theory.

## Examples

One *possible* way of framing the problem: the VNM axioms, the Kolmogorov probability axioms, and/or other rationality frameworks give us a **notion of consistency**. We can check our behaviors and opinions for inconsistency. But what do we do when we *notice* an inconsistency? Which parts are we supposed to change?

Here are some cases where there is at least a *tendency* to update in a particular direction:

- Suppose we value an event at 4.2 expected utils. We then unpack into two mutually exclusive sub-events, . We notice that we value at 1.1 utils and at 3.4 utils. This is inconsistent with the evaluation of . We usually trust less than the unpacked version, and would reset the evaluation of to .
- Suppose we notice that we're doing things in a way that's not optimal for our goals. That is, we notice some new way of doing things which is better for what we believe our goals to be. We will tend to change our behavior rather than change our beliefs about what our goals are. (Obviously this is not always the case, however.)
- Similarly, suppose we notice that we are acting in a way which is inconsistent with our beliefs. There is a tendency to correct the action rather than the belief. (Again, not as surely as my first example, though.)
- If we find that a belief was subject to base-rate neglect, there is a tendency to multiply by base-rates and renormalize, rather than adjust our beliefs about base rates to make them consistent.
- If we notice that X and Y are equivalent, but we had different beliefs about X and Y, then we tend to pool information from X and Y such that, for example, if we had a very sharp distribution about X and a very uninformative distribution about Y, the sharp distribution would win.

If you're like me, you might have read some of those and immediately thought of a Bayesian model of the inference going on. Keep in mind that this is *supposed* to be about noticing *actual inconsistencies*, and what we want is a model which deals directly with that. It might turn out to be a kind of meta-Bayesian model, where we approximate a Bayesian superintelligence by way of a much more bounded Bayesian view which attempts to reason about what a truly consistent view would look like. But don't fool yourself into thinking a standard one-level Bayesian picture is sufficient, just because you can look at some of the bullet points and imagine a Bayesian way to handle it.

It would be quite interesting to have a general "theory of becoming rational" which had something to say about how we make decisions in cases such as I've listed.

## Logical Uncertainty

Obviously, I'm pointing in the general direction of logical uncertainty and bounded notions of rationality (IE notions of rationality which can apply to bounded agents). Particularly in the "noticing inconsistencies" framing, it sounds like this might *entirely* reduce to logical uncertainty. But I want to point at the broader problem, because (1) an example of this might not immediately look like a problem of logical uncertainty; (2) a theory of logical uncertainty, such as logical induction, might not entirely solve this problem; (3) logical uncertainty is an epistemic issue, whereas this problem applies to instrumental rationality as well; (4) even setting all that aside, it's worth pointing at the distinction between ideal notions of rationality and applicable notions of rationality as a point in itself.

# The Ideal Fades into the Background

So far, it sounds like my suggestion is that we should keep our idealized notions of rationality, but also develop "theories of approximation" which tell us what it means to approach the ideals in a good way vs a bad way. However, I want to point out an interesting phenomenon: sometimes, when you get a really good notion of "approximation", the idealized notion of rationality you started with fades into the background.

## Example 1: Logical Induction

Start with the Demski Prior, which was supposed to be an idealized notion of rational belief much like the Solomonoff prior, but built for logic rather than computation. I designed the prior with approximability in mind, because I thought it should be a constraint on a normative theory that we actually be able to approximate the ideal. Scott and Benya modified the Demski prior to make it nicer, and noticed that when you do so, the approximation itself has a desirable property. The line of research called asymptotic logical uncertainty focused on such "good properties of approximations", eventually leading to logical induction.

A logical inductor is a sequence of improving belief assignments. The beliefs do converge to a probability distribution, which will have some resemblance to the modified Demski prior (and to Solomonoff's prior). However, the concept of logical induction gives a much richer theory of rationality, in which this limit plays a minor role. Furthermore, the theory of logical induction comes much closer to applying to realistic agents than "rational agents approximate a Bayesian reasoning with [some prior]".

## Example 2: Game-Theoretic Equilibria vs MAL

Game-theoretic equilibrium concepts, such as Nash equilibrium and correlated equilibrium, provide a rationality concept for games: rational agents who know that each other are rational are supposed to be in equilibrium with each other. However, most games have multiple Nash equilibria, and even more correlated equilibria. How is a rational agent supposed to decide which of these to play? Assuming only the rationality of the other players is not enough to choose one equilibrium over another. If rational agents play an equilibrium, how do they get there?

One approach to this conundrum has been to introduce refined equilibrium concepts, which admit some Nash equilibria and not others. Trembling Hand equilibrium is one such concept. This introduces a notion of "stable" equilibria, pointing out that it is implausible that agents play "unstable" equilibria. However, while this narrows things down to a single equilibrium solution in some cases, it does not do so for all cases. Other refined equilibrium concepts may leave no equilibria for some games. To get rid of the problem, one would need an equilibrium concept which (a) leaves one and only one equilibrium for every game, and (b) follows from plausible rationality assumptions. Such things have been proposed, most prominently by Harsanyi & Selten A General Theory of Equilibrium Selection in Games, but so far I find them unconvincing.

A very different approach is represented by multi-agent learning (MAL), which asks the question: can agents learn to play equilibrium strategies? In this version, agents must interact over time in order to converge to equilibrium play. (Or at least, agents simulate dumber versions of each other in an effort to figure out how to play.)

It turns out that, in MAL, there are somewhat nicer stories about how agents converge to *correlated* equilibria than there are about converging to Nash equilibria. For example, Calibrated Learning and Correlated Equilibrium
(Foster & Vohra) shows that agents with a calibrated learning property will converge to correlated equilibrium in repeated play.

These new rationality principles, which come from MAL, are then much more relevant to the design and implementation of game-playing agents than the equilibrium concepts which they support. Equilibrium concepts, such as correlated equilibria, tell you something about what agents converge to in the limit; the learning principles which let them accomplish that, however, tell you about the *dynamics* -- what agents do at finite times, in response to non-equilibrium situations. This is more relevant to agents "on the ground", as it were.

And, to the extent that requirements like calibrated learning are NOT computationally feasible, this *weakens our trust in equilibrium concepts as a rationality notion* -- if there isn't a plausible story about how (bounded-) rational agents can get into equilibrium, why should we think of equilibrium as rational?

So, we see that the bounded, dynamic notions of rationality are more fundamental than the unbounded, fixed-point style equilibrium concepts: if we want to deal with realistic agents, we should be more willing to adjust/abandon our equilibrium concepts in response to how nice the MAL story is, than vice versa.

## Counterexample: Complete Class Theorems

This doesn't always happen. The complete class theorems give a picture of rationality in which we *start* with the ability and willingness to take Pareto-improvements. Given this, we *end up* with an agent being classically rational: having a probability distribution, and choosing actions which maximize expected utility.

Given this argument, we become more confident in the usefulness of probability distributions. But why should this be the conclusion? A different way of looking at the argument could be: we don't need to think about probability distributions. All we need to think about is Pareto improvements.

Somehow, probability still seems very useful to think about. We don't switch to the "dynamic" view of agents who haven't yet constructed probabilistic beliefs, taking Pareto improvements on their way to reflective consistency. This just doesn't seem like a realistic view of bounded agents. **Yes,** bounded agents are still engaged in a search for the best policy, which may involve finding new strategies which are strictly better along every relevant dimension. But bounded agency **also** involves making trade-offs, when no Pareto improvement can be found. This necessitates thinking of probabilities. So it doesn't seem like we want to erase that from our picture of practical agency.

Perhaps this is because, in some sense, the complete class theorems are not very good -- they don't really end up explaining a less basic thing in terms of a more basic thing. After all, when can you realistically find a pure Pareto improvement?

# Conclusion

I've suggested that we move toward notions of rationality that are fundamentally bounded (applying to agents who lack the resources to be rational in more classical senses) and dynamic (fundamentally involving learning, rather than assuming the agent already has a good picture of the world; breaking down equilibrium concepts such as those in game theory, and instead looking for the dynamics which can converge to equilibrium).

This gives us a picture of "rationality" which is more like "optimality" in computer science: in computer science, it's more typical to come up with a notion of optimality which *actually applies* to some algorithms. For example, "optimal sorting algorithm" usually refers to big-O optimality, and many sorting algorithms are optimal in that sense. Similarly, in machine learning, regret bounds are mainly interesting when they are achievable by some algorithm. (Although, it could be interesting to know a lower bound on achievable regret guarantees.)

Why should notions of rationality be so far from notions of optimality? Can we take a more computer-science flavored approach to rationality?

Barring that, it should at least be of critical importance to investigate in what sense idealized notions of rationality are normative principles for bounded agents like us. What constitutes cargo-culting rationality, vs really becoming more rational? What kind of adjustments should an irrational agent make when irrationality is noticed?

The thread here (and I mean this is summary, not as insight) appears to be the following approach.

Consider how actors lacking some previously-assumed perfection can approach that perfection in some limit (asymptotic performance / equilibrium / ...). A big reason to care about such limit properties is to undergird arguments about performance in the real world. For example, the big O performance of an algorithm is used (with caveats) for anticipating performance on large amounts real-world data.

Sometimes, when we're doing conceptual cleanup to be able to make limit arguments, we end up with formalisms that directly give us interesting properties in the intermediate stage. We may be able to throw away the arguments from limit behavior, and thus stop caring much about the limit or the formalisms we approximate there. This is the sense in which 'the ideal fades into the background'

Yep, that's a good way to explain it!

I feel like the logical inductor analogy still has more gas in the tank. Can we further limit the computational power and ask about the finite-time properties of some system that tries to correct its own computationally-tractable systematic errors? I feel like there's some property of "not fooling yourself" that this should help with.

This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.

There are a few black boxes in the theory. The first takes

youand returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states, maybe it's python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don't have to be like this.Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.

The main thing I want to point out that

this is an idealized notion of non-idealized decision theory-- in other words, it's still pretty useless to me as a bounded agent, without some advice about how to approximate it. I can't very well turn into this max-expected-value bounded policy.But there are other barriers, too. Figuring out what utility function I endorse is a hard problem. And we face challenges of embedded decision theory; how do we reason about the counterfactuals of changing our policy to the better one?

Modulo those concerns, I do think your description is roughly right, and carries

someimportant information about what it means to self-modify in a justified way rather than cargo-culting.