EfficientZero: How It Works

1a3orn

The goal of this essay is to help you understand EfficientZero, a reinforcement learning agent that obtains better-than-human median performance on a set of 26 Atari games after just two hours of real-time experience playing each game.

Specifically, it gets 116% of human median performance on the data-limited Atari 100k benchmark. The previously-best algorithm only reached about 41% of median human performance, so this is a reasonably large leap.

Chart stolen from paper

Chart of performance

The benchmark is called 100k because agents only interact with the environment for 100,000 steps -- about two hours. Note also that the human benchmarks were also set after the humans in question had about two hours of experience on the game. So EfficientZero seems to -- at least on this set of games -- exceed humans in sample efficiency specifically.

This is particularly impressive when you recall that, going into this, the agents were entirely ignorant of anything whatsoever about the world. The networks of their brains were initialized with random values. Humans manage to do pretty well on Atari after two hours, but we've pretrained in the actual world for many years, which lets us apply analogies from this world, experience from other video games, and so on. This agent managed to do comparably well with none of these advantages.

(Granted, the 100k benchmark focuses on Atari environments which are relatively easy to make progress in, because it was meant to be used for sample-efficiency benchmarks. It excludes extremely-difficult-to-explore environments like Montezuma's Revenge, where the first reward is quite hard to get.)

So. After reading this, you should understand how EfficientZero works, and how the changes in EfficientZero improve upon its predecessors. You should also know what further improvements to it are likely in the coming 3 to 24 months.

My target reader is someone who can program, or is at least broadly technically literate. They should be reasonably familiar with the basic principles of supervised machine learning, but they do not need to be particularly up-to-date with reinforcement learning. Reading this will nevertheless be heavy going if you aren't at least a little familiar with reinforcement learning; expect to have to reread some stuff.

I have aggressively and in some cases silently cut content that is not necessary for understanding the contributions of EfficientZero. For instance, in my explanation of how EfficientZero's predecessor MuZero works, I have ignored major elements of MuZero that EfficientZero does not modify. I have also only explained the version of MuZero ("MuZero Reanalyse") that EfficientZero builds upon. And finally in some places I'm confused about why techniques work as well as they do; I've noted them rather than try to smooth them over.

I'll divide this into five sections.

Reinforcement Learning Primer: A quick refresher on standard terminology and methods in RL, just for context. You can skip this, if you're remotely familiar with RL. What is the kind of problem that RL solves?
Historical background: EfficientZero is a relatively natural advance in model-based RL, given prior advances in model-free RL and elsewhere. It's useful to look at the content of these other advances, in order to understand EfficientZero and the conditions of EfficientZero's existence.
MuZero: MuZero is the 2019 algorithm which extended the Go-playing algorithm AlphaGo to single-player, non-zero sum games like Atari. How does MuZero work?
EfficientZero: EfficientZero is basically three independent modifications to MuZero stacked on top of each other. What do these three independent changes do?
Conclusion. What kind of work is likely to follow EfficientZero? Does EfficientZero provide any evidence about "how hard" it is to improve reinforcement learning right now?

1: Reinforcement Learning Primer

Alright, a quick review of the basics. If you follow RL at all, you've seen all the information below so many times at the start of various papers that it's stuck painfully to your retina. So feel free to skip.

(If you want a longer version, Sutton and Barto's "Reinforcement Learning: An Introduction" is the standard, excellent introductory text.)

The basic abstraction of a reinforcement learning problem is as follows.

There is an agent and an environment. The agent and an environment interact over a series of episodes.

Each episode has some finite number of timesteps, counting from 0 to . These episodes each correspond to a thing like "a game of chess" or "a level of a video game" or "an attempt to put a product in a box in a warehouse" and so on. They're always considered to be causally isolated from each other, except for the fact that an agent can learn from one to the next.

At every timestep $t$ within such an episode, the environment emits some observation and reward to the agent, $o_{t}$ and $r_{t}$ . The observation is what the agent perceives; it's what the agent sees, hears, or feels; we usually represent it with a big matrix or tensor of numbers. The reward, on the other hand, is a single number that's either positive or negative, corresponding roughly to pleasure or pain.

(The observations usually need to be rolled up into a state $s_{t}$ that contains all the relevant information for acting. This is often produced by stacking the last 4 to 10 observations — frames of video in a video game, for instance — on top of each other and saying "This is probably enough." Sometimes, the state is instead produced by a sequence-to-fixed-length-vector neural network.)

In response to the observation / state, the agent emits some action, $a_{t}$ . The environment processes the action, and emits the next observation and reward, $o_{t + 1}$ and $r_{t + 1}$ . And so on until the episode ends.

Image from aforementioned Sutton and Barto; every RL article is contractually obligated to have some version of this image in it

Positive reward is good, obviously, and higher positive reward is better.

The agent doesn't want to maximize the immediate per-frame reward, though. If you chose an action which gets you 100 reward in the next frame, but 0 reward for each of a thousand frames afterwards, that's obviously inferior to an action which would get you 0 reward in the next frame, but 100 reward for each of a thousand frames afterwards. The marshmallow now might be inferior to the two marshmallows later. Maybe.

So instead the agent wants to maximize the sum of future expected reward, which is called the return, which I'll abbreviate $G$ because otherwise "return" and "reward" will get confused and also because everyone else does it. This is the total of $r_{t} + r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} . . .$ until the end of the episode.

(The rewards that sum to make the return $G$ are normally discounted according to how far in the future they are.

But in my first draft of this essay, the time-discount variable just cluttered up many subsequent formulas, which I am trying to keep simple; also, it provided no EfficientZero-specific insight. So I'll falsely pretend for the rest of the essay that EfficientZero uses an undiscounted reward, and present the math accordingly -- including in my definitions for state-value and action-value functions, which is honestly a pretty sketchy decision on my part.)

Ahem. The total return mentioned above, $r_{0} + r_{1} + r_{2} + r_{3} + . . . r_{T}$ can be more compactly defined using a sum notation.

$Return = G = T \sum t = 0 r_{t}$

Return, then, is what the RL agent wishes to maximize.

In order to do this, the agent must have a policy. The policy of an agent is just whatever makes decisions for it.

You can think of an agent’s policy as a mapping from a state to an action, or from a history of observations to an action, or from a state to an action distribution. It is usually represented by the symbol $π$ . The internals of the policy could be anything: it could include a learned model of the dynamics of the environment, it might use the state-value function defined below, it could be entirely random, and so on.

For all of what follows, I'll assume that the agent is taking discrete actions; i.e., it will choose one element out of a finite, predefined set of possible actions to perform. For instance, in a side-scrolling video game, the set might include go left, go right, jump, and attack.

Specifically, we'll represent the output of a policy as a distribution over actions. Thus, if an agent could do one of go left, go right, or be still, a policy might output [0.33, 0.33, 0.33] to indicate that it is ambivalent between the actions, or [0.9, 0.05, 0.05] to indicate it has a strong preference for one or another. Then when actually acting, the agent samples randomly from this policy distribution.

Because of this, you can think as a policy as a function taking a state and an action, and returning a probability -- $π (s_{t}, a_{t}) \to R$ -- or as a function taking a state and returning something like an array -- $π (s_{t}) = [0.2, 0.6, 0.2]$ .

Policies, of course, can be better or worse. The optimal policy is one for which from any state no other policy can lead to greater return, in expectation.

I introduce some more terms below. But these terms are somewhat less important. They are things that are useful for determining how to maximize return. But they're ultimately only useful if they help us maximize it. They are calculated on the way to maximizing reward by many current algorithms -- but a perfect intelligence coded by someone with a textbook from the future might not explicitly use them at all, because by then we might have discovered better abstractions.

Indeed, some ways of addressing the RL problem don't use them at all today, such as evolutionary methods or methods derived from supervised learning. For our purposes, however, they're relatively important.

Two of these things are a policy's state-value function and action-value function.

The state value function, $V_{π} (s_{t})$ , returns the expected total future return, given a particular policy, from a particular state.

Intuitively it's something like "given where I am, and given how I act, am I likely to be rewarded or punished?" The state-value function for a game of chess, given that you're an average player playing against an average player, is probably high if you're 8 points worth of pieces ahead; it's probably low if you're 8 points of pieces behind.

On the other hand, the state-value function only exists relative to a particular policy. If Magnus Carlsen is dropped into a game of chess against me where he starts 8 points behind, his state-value function for that game would still correctly return a high value, because he's just that much better at chess than me.

You can define it compactly thus, where E stands for expected value:

$V_{π} (s_{t}) = E [G | s_{t}, π] = E [T \sum i = t r_{i} | s_{t}, π]$ .

To repeat myself: It answers the question, given that you are following such and such a policy from such and such a state, how much reward do you expect?

The action-value function is a very similar concept. It returns the expected future return, given a particular policy, from a particular state and from a particular action.

Or, in short, it's exactly the same as the state-value function except instead of drawing the next action from the distribution of $π$ , the next action is already specified.

$Q_{π} (s_{t}, a_{t}) = E [G | s_{t}, a_{t}, π] = E [T \sum i = t r_{i} | s_{t}, a_{t}, π]$

For example: Suppose I am riding a bike. The action-value function, given the state "riding a bike" and the action "suddenly twisting the bike handlebars all the way to the left", probably returns a lower number than if it is given the state "riding a bike" and the action "riding like a normal human being," because twisting the handlebars is likely to make me crash and crashing the bike will cause me pain.

Note that, importantly, if we have the action-value function for the optimal policy, then we have everything we need to act perfectly.

From each state, we can just check each possible action against the action-value function $q (s_{t}, a_{t})$ , and choose the action that returns the highest value from the action-value function. Greedy search against the action-value function for the optimal policy is thus equivalent to the optimal policy. For this reason, many algorithms try to learn the action-value function for the optimal policy.

Knowing the state value function for the optimal policy does not automatically let you take the best action. Can you see why that is? Think about it for a moment. What else would you need?

In order to choose the best action from the state value function, you would need a model. A model is a function with pretty much the same type signature as the environment -- it takes a state $s_{k}$ and an action $a_{t}$ and returns the estimated future state $s_{k}$ and / or a reward $r_{t}$ . If you have a good model, then you can use the state value function to choose the best action, by running it over the predicted states following from different actions and choosing the action associated with the best state. A model is what lets you map a state to other states though actions, and then pick the action that generates a state with highest expected value.

(Like how in Chess, your gut feel about the goodness of the position of the board corresponds to a state-value function. You consider a few moves you could make, and maybe a few trees of moves several levels down the game tree, and then choose the move that seems to lead to the best state-value function.)

If a reinforcement learning agent uses a model it is model-based; if it doesn’t, it is model-free.

Model-based agents are attractive for several reasons. They let you plan from among several imagined futures, which seems like something an actually intelligent agent would be able to do. But in fact they're pretty hard to put together well.

One reason for this is that if you train a model in the naive way -- just predict some observation $o_{t + 1}$ given a history of observations -- most of what it learns is irrelevant for planning. Predicting every future pixel of a video feed on a self driving car -- including the clouds, the waving tree branches, the little argumentative marital drama going on in the car in front of you -- means that you're spending the vast majority of your model's capacity on things that are irrelevant for actually planning out the route of the self-driving car. You want some way to only predict parts of the world that matter, but it's hard to put that into math.

Ok. Given the above understanding of terms, we're now a little better equipped to understand the historical background for EfficientZero, and EfficientZero itself.

2: Historical Background

Here's a brief illustration of some prior results that influenced EfficientZero. (Note that this is by no means a complete summary.)

Relevant Papers

There are three basic points about what came before EfficientZero that I want to make here.

First, EfficientZero is mostly a continuation of the AlphaGo / AlphaZero / MuZero line of model-based learning research.

You almost certainly remember this line of research, because in 2016 AlphaGo handily defeated many-times-world-champion Lee Sedol at Go, which had until that date evaded computer dominance. (And AlphaGo was of course memorably enshrined as maaaaaybe smoke-under-the-door for AGI.)

AlphaGo was initially trained to imitate human Go games, however. Its initial policy was trained to predict the moves of expert human Go players, given a board history. Shortly after AlphaGo, however, DeepMind created an agent which was trained entirely off games it played against itself, without human data to imitate -- and it was therefore named AlphaZero. This algorithm also obtained superhuman performance at Chess and Shogi.

Both AlphaGo and AlphaZero, however, were limited in one particular and important way: they had to possess hard-coded models of the game.

As model-based reinforcement learning algorithms, AlphaGo and AlphaZero both had a predictive model of the environment. This model told them exactly what the next state of the environment would be conditional on some action. This model was itself hard-coded into them, and not learned. AlphaGo and AlphaZero used neural networks to determine which branches within the game tree constructed by this model were promising to explore. They learned, for instance, state-value functions to evaluate how likely they were to win from various hypothetical positions, by training on who eventually won in their self-play games. But what branches in the tree were actually possible was hard-coded by the game model.

Of course, it's very easy to code a predictive model of an environment for a deterministic, two player game of perfect information. But it does greatly limit the domain to which such an algorithm can be applied.

MuZero extended AlphaZero to domains where it does need to learn its model, and thereby extended the applicability of this kind of research. That is, MuZero starts entirely ignorant about what the dynamics of the world are. It learns to predict the next state of the world, conditional on some action, by encountering that world. (Or at least, it learns to predict aspects of the world; more on that later.) And it uses its model of the world to build the transitions over edges in its tree search.

MuZero was able to do well at Go, Chess, Shogi, and Atari; it did not lose its faculty for Go by gaining the ability to play Atari. (It in fact did better at Go than AlphaZero, which is honestly still kinda confusing for me.)

And EfficientZero is simply MuZero with three further modifications, which I will discuss later.

That's the "primary" research lineage for EfficientZero, if you had to choose one.

But it's not the only line of research that EfficientZero draws from. In the blue in the image above, you can see an important line of model-free research that precedes EfficientZero.

With the exception of Alpha* based research, model-free algorithms are generally the ones which have been used in big prestige projects like Starcraft or Dota2.

In model-free RL agents, the generalizing power of a neural network is used to make the agent do the-kind-of-actions-that-lead-to-more-rewards-in-similar-situations-in-the-past, rather than helping the agent choose actions that lead to the best outcomes in a series of imagined future trajectories.

For instance, one common way that model-free RL agents learn is by starting off by learning the aforementioned action-value function, $Q (s_{t}, a_{t})$ for the random policy. That is, the agent does random things; and the agent tracks locations in the world where particular random actions cause it pleasure or pain. Pressing your hand down on a glowing stove causes pain, for instance.

But by acting greedily with respect to this action-value function, then, and doing the kinds of action that cause pleasure and avoiding those that cause pain, such agents can improve their own policy. You can remember that pressing your hand down on a glowing stove causes pain, and not do that.

This in turn alters the true values for the action-value function, which the agent is learning, and so on and so forth, backwards and forwards, with pleasure and pain moving backward through the agent's anticipations; until eventually, in theory, the agent arrives at the optimal policy. This iteration between learning an action-value function for a bad policy and using it to improve the policy is called policy iteration, and is fundamental to a lot of reinforcement learning.

The specific 2021 model-free paper from which EfficientZero draws is "Data-Efficient Reinforcement Learning with Self-Predictive Representations" (henceforth SPR) which achieved the prior state-of-the-art data efficiency result on Atari 100k, while learning an action-value function.

How did SPR do this?

While learning the action-value function, pretty much all model-free algorithms have to learn a "representation" of the world as an intermediate step. This representation (hopefully) summarizes the relevant features of an agent's observations, to let the agent make decisions easily; when driving a car, your representation of the world should make "Is there a car driving towards me in the wrong lane" a readily-accessible fact. (Although one can hope that you didn't acquire the representation through model-free learning, which would be both hazardous and expensive.)

What SPR did was use feature-learning techniques from supervised learning to make the representation of the world learned by $Q (s_{t}, a_{t})$ also predictive of the representation of the world in subsequent time-steps. The way you think about the world now, should probably provide some hints about the way you'll think about the world tomorrow. As the paper says, "we start with the intuition that encouraging state representations to be predictive of future states given future actions should improve... data efficiency." This turned out to be true; SPR had a 55% better score than any prior agent on the 100k Atari task.

Are you confused? Doesn't that sound something like model-based learning, to anticipate the future? Good job noticing that, if so! SPR borrows somewhat from model-based learning, in that its state representation is predictive of future state representations. Or, slightly more precisely, its state representation is predictive of an (initially) random projection of the future state representation.

This takes us to the second point: the three-word summary of EfficientZero is "MuZero plus SPR".

This is a slight oversimplification. EfficientZero makes three large changes, of which only the largest is from SPR. And this largest change is not absolutely identical between the two. But nevertheless, if we allow lossy compression, saying that EfficientZero is "MuZero + SPR" is a pretty close summary of the thing.

So this is the history of EfficientZero -- it has MuZero as it's most recent ancestor on its model-based side, and SPR as it's most recent ancestor on its model-free side. So let's take a look at how MuZero works.

3: MuZero: How It Works

According to Julian Schrittweiser, one of the chief authors of MuZero, the mu in the title stands for at least three things.

First, it can be pronounced like Japanese for "dream", which makes sense because MuZero learns by imagining / dreaming about past experience. Second, it can be pronounced like the Greek character μ, which makes sense because that character frequently stands for a model. And finally, it can be pronounced like Japanese for "void" or "nothing", which makes sense because it receives no rules about the environment before it begins to learn.

(This last use for mu, of course, is the same as that in the first koan of "The Gateless Gate," a classic work of Zen Buddhism. Here, mu is given as the answer to the question "Does a dog have the Buddha nature?" The applicability to AI is left as exercise to the reader.)

Anyhow. My explanation of MuZero will proceed as follows.

First, I'm going to describe how the neural network within MuZero works, supposing (falsely) that it happened to start out perfectly trained. This will help us understand the intended semantics.

Second, I'm going to explain how MuZero can use an (imperfectly trained) neural network to produce policy distributions and value functions better than those that the neural network produces by itself. The method by which MuZero does this is called Monte-Carlo Tree Search (MCTS), although this name is quite misleading.

And third and finally, I'll explain how MCTS and the neural network interact with the environment to produce training data, which can be used to train the neural network towards the optimal policy.

(Note that in all of the below, although I refer to "MuZero" for brevity, I'm universally referring to the much, much more sample-efficient version "MuZero Reanalyse." This was introduced in the same paper as MuZero and later expanded further by DeepMind.)

3.1: The Neural Network

For all this section, I'll pretend that the neural network MuZero uses is already trained. This will help us understand the semantics for it.

There are three sub-networks that make up MuZero's neural network. Ultimately, these are all just pieces of one big, end-to-end differentiable and end-to-end trained network. But it's easiest to understand it as broken into several bits. (In all that follows I'll subscript the components of the neural network with $θ$ to indicate that all these pieces are defined by the parameters of the network, $θ$ .)

The first is the representation function. It takes the current and prior observations, and emits a state. This state is the input for the other two networks.

If we call the representation function network $h_{θ}$ , then you can picture it thus:

H-Theta Image

So $h_{θ} (o_{t}, o_{t - 1}, o_{t - 2}, o_{t - 2}, . . .) \to s_{k}$ . (Here and henceforth, generally k = t, but k is used to index states within the neural network's imagined view of the environment rather than observations in the environment).

Don't think right now about what $s_{k}$ stands for, or what it compresses, or what it's useful for predicting. It's an empty vessel, which we assume to have been trained via back-propagation so that it gives us all the information necessary for the other two sub-parts of the network to do their task.

I'll note in advance: one of the things that makes MuZero stand out relative to other model-based neural networks is that in many others, the network is trained so that $s_{k}$ will have information useful, for instance, for correctly predicting future observations. This is, again, what many model-based video-game playing agents do: they predict a future action-conditioned frame of video from prior frames of video. There is no loss in MuZero, however, which dictates that this be the case.

What are these other tasks, though?

Well, the second network is the dynamics network. It takes a state $s_{k}$ , a hypothetical action, and then returns another hypothetical state $s_{k + 1}$ and reward estimated for that action $r_{k + 1}$ .

If we call the dynamics network $g_{θ}$ , then you can represent it graphically thus.

G-Theta Image

So $g_{θ} (s_{k}, a_{k}) \to s_{k + 1}, r_{k + 1}$ . This function has to be executed for every action whose effect you wish to predict.

Note that frequently one would call the dynamics network with different actions, to evaluate alternative possible trajectories. In such a case, the single branch shown above would separate into multiple, different $s_{k} s$ for the same k, and look more like a tree than a single trunk.

So now we know at least one thing the representation $s_{k}$ must contain information about; it must contain enough information about the environment to let us predict future rewards, conditional on different actions.

If the network is perfectly trained, then, if you get rewards of 0, 0, and 1 after doing the actions move left, move left, and move left in the actual environment from some particular state, then you should also get imagined rewards of 0, 0, and 1 from the neural network after doing the same actions in the imagined environment from a particular state. Note though, that while doing these actions in the actual environment would also let you see new observations each frame; the neural network is not trained to predict its observations, but only to predict the reward.

Nevertheless, if the dynamics network gave us perfectly accurate results, then we could use it to plan our actions.... sorta. Here's how:

From our current state $s_{k}$ , we could look at the reward immediately following every different action we could do. We would then have many different $s_{k + 1}$ s, together with the $r_{k + 1}$ s associated with them. And we could then expand each $s_{k + 1}$ to get all possible $s_{k + 2}$ and $r_{k + 2}$ , and so on and so forth.

Obviously, this breadth-first search gets pretty unmanageable pretty quickly, but it would at least let us choose actions leading to higher reward within very shallow depth.

But this isn't terribly useful. We often want to plan to receive rewards hundreds or thousands of steps into the future, and the search space becomes completely unmanageable if we try to plan with only the dynamics function.

So we need a final network: the prediction network.

The prediction network takes the aforementioned state $s_{k}$ and outputs the results of the state-value function $V_{π} (s_{k})$ and a distribution over actions for a policy $π (s_{k})$ . Ultimately, we want the policy to be the optimal policy and the state-value function to be the correct value for the optimal policy.

If we call the prediction network $f_{θ}$ , this makes the final piece of our situation look like this:

Totality-Theta

So $f_{θ} (s_{k}) \to V (s_{k}), π (s_{k})$ . You execute this part of the neural net for every state whose value function and action you are interested in.

This can let us plan much more easily. If the network is perfectly trained, we could use either the state-value function or the policy to perform the perfect action in any situation.

The policy distribution straightforwardly gives you the best action.

The value function is scarcely harder to use. We can simply run the dynamics function once for each possible action, and then, for each state the dynamics function spits out, we compute the state value function for that state. Then we just choose the action which has the highest state value associated with it.

Now we know the three outputs of the network -- the reward, the value function, and the distribution over actions. We know what values we want each of these to take. We want the reward to take the true value that it would take in the environment, conditioned on prior states and actions. And we want the value function and distribution to take the value corresponding to the optimal policy, conditioned on the same thing.

But how can we get the right data to train this network, if we drop the assumption that it is not already entirely trained?

3.2: Monte Carlo Tree Search

One thing to note about the above is that, in an imperfectly trained neural network, the values the network described above emits are not necessarily consistent.

What do I mean by this? Well, imagine a scenario where the network has produced a particular state, $s_{k}$ , from the representation function. And imagine furthermore that the policy function gives an even distribution over the two actions available, left and right.

But if you apply the dynamics function to the network, suppose that the dynamics function predicts that you get 1000 reward in the next state if you take action left, and continue to get this reward forever after, and the dynamics function predicts that you get -1000 reward in the next state if you take action right, and continue to get that reward forever after. So the even distribution of the policy at state $s_{k}$ is inconsistent with the reward predicted by the model at the next state. The two parts of the neural network don't form a consistent picture of the world together.

At the very highest level of abstraction, Monte Carlo Tree Search as used by MuZero just uses the neural network to produce policy and state value estimates that are more internally consistent.

Due to how the network is trained, as we'll see in 3.3, the policies produced are also better, but fundamentally that's not what's going on. If you screwed up the training data for the MuZero network very badly, well, MCTS would continue to produce more consistent values for the network over time, even if the values were very wrong.

In detail, Monte-Carlo Tree Search in MuZero produces:

State value estimates $V (s_{k})$ that are more consistent with later-in-time state-value estimates, policies, and rewards. (Rewards and state-value estimates ultimately play a larger role than policies in arriving at the right estimate.)
Policy estimates $π (s_{k})$ that are also more consistent with later-in-time state-value estimates, policies, and rewards. (As in the prior parenthetical.)
(MCTS does not produce a new target for the predicted reward.)

I'm actually going to skip lightly over how MCTS does this, because it remains ~~entirely~~ mostly unchanged through MuZero and EfficientZero.

Note that Monte Carlo Tree Search differs from both the familiar depth-first-search and breadth-first search in programming. Generally speaking, it will expand the best nodes in the game tree available first, using the value function and the policy from the neural network to figure out which are best. This means that it is much, much better at handling large state spaces, and why it was used as a policy improvement operator for a game like Go with a large state space. So by skipping it I don't mean to imply that it isn't important; just that it's relatively unimportant for the changes that EfficientZero makes, and that this essay is already way too long.

(You can read a summary of how MCTS works -- with gifs! -- on DeepMind's page on MuZero.)

One further thing to note.

The family of algorithms currently called "MCTS" originally involved -- as the name "Monte Carlo" suggests -- randomness. In short, originally, the algorithm tried to evaluate whether a game-state in a board game was good or bad by looking at whether you were likely to win or lose if you assume random or semi-random moves from this state to the end of the game.

MuZero's Monte-Carlo Tree Search... does not involve such rollouts to the end of the game. It only uses the neural network to evaluate the goodness of states. There's actually no randomness involved at all. So I think it's a little misnamed.

Anyhow, given that MCTS gives you a more consistent policy and value function, there are two things you can do.

You could use the new policy distribution to choose the best action, while actually choosing what to do in a real environment. MuZero does this.
You could use the more consistent policy distribution / value function as targets for the neural network, to train it to be better. MuZero also does this. So let's now turn to MuZero's training.

3.3: MuZero's Training

Let's suppose we have an untrained neural network, with the architecture described above. And let's suppose it has experienced several episodes in our environment, and saved the trajectories from these episodes. How can we train our neural network?

Well, the trajectories through the environment are just a series of observations, rewards, and actions.

$o_{0}, r_{0}, a_{0}, o_{1}, r_{1}, a_{1}, o_{2}, r_{2}, a_{2} . . . o_{T}, r_{T}$

It turns out to be relatively straightforward to train MuZero given such a history.

We choose a timestamp $t$ in the episode, and, using the representation network, generate a state $s_{k}$ for that timestamp.

We then generate subsequent states $s_{k + n}$ using the actual actions $a_{t}$ that the agent took at corresponding time-steps.

Having done so, then we can train the reward $r_{k + n}$ to simply match the environment's actual reward at time $s_{t + n}$ .

Training the value function is only a little more complicated.

If you remember, the state value function is just the expected return beneath a policy:

$V (s_{t}) = E [R | s_{t}, π = E [r_{t + 1} + r_{t + 2} + . . . . r_{t + N} | s_{t}, π$

This means that if we gathered enough data from the environment, with the policy, we could just move the value for the state-value function towards the sum of future rewards from the environment. Over a long enough time frame, the average return from each state, given that you're using a particular policy, will definitionally converge towards the state-value function value for that state and policy.

Simply training off entire rewards like this, though, tends to be somewhat slow empirically. We can improve our performance by bootstrapping: training the value-function to move towards a target that we create by summing together a short chain of rewards with the current value function at the end of the chain.

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + V (s_{t + 4})$

Here, we move $V (s_{t})$ towards the value on the right, which is constructed out of a series of actual rewards plus the current estimate of the state-value on a future state. Even if the state-value function starts off entirely randomly, this target will incorporate experienced rewards and over time cause $V$ to converge to the right value.

Finally, the target for the policy for the neural network is simply the improved / more consistent policy distribution generated for each state by Monte Carlo Tree Search. Here we lean heavily on MCTS; while the targets for the rewards and the value function are generated with data directly from the experienced episode trajectory, the target for the policy only indirectly uses this experience via the changes to the neural network.

So over time, training looks something like this.

At first, everything the network does is random. But even from random actions, the neural network can learn to predict the rewards $r_{t}$ it encounters in the environment. The rewards in the environment also begin to move $V (s_{k})$ towards an accurate value for a random policy.
But MCTS will now return a policy distribution improved by learned values for $r_{t}$ and $V (s_{k})$ . This MCTS-improved policy is used in action selection, so the action selection is now better than random. And it is also used as a target for training the policy distribution, so the policy distribution moves towards a better-than-random policy.
This means that the targets for $V (s_{k})$ will, in turn, change again, because the state-value function is policy-specific. The better the policy, the higher the correct value for $V (s_{k})$ . The state-value function no longer predicts the state-value for a random policy, but for a better-than random policy.
This loop continues, where improved estimates for the state-value function and rewards propagate through MCTS into the policy. Altogether, these improved estimates also improve the depth to which MCTS can search (although I haven't gone into how this detail works).
Because the targets for $V (s_{k})$ and $π$ are generated with MCTS, even old historical data can be relevant, because the network essentially imagines what the right values would be if it had experienced them with it's current estimates.

Over time, this process converges on (hopefully) the optimal policy.

4: EfficientZero: What It Changes

EfficientZero makes three independent changes on top of MuZero, which when taken together push it into better-than-human performance. If you drop any one of them, median performance on the Atari 100k drops to worse-than-human. They are, nevertheless, independent, and we can discuss them separately.

(EfficientZero also makes a few other changes to MCTS and network architecture; some of the networks within it are smaller than those in MuZero, for instance. The authors of the paper think these changes don't matter too much, so let's hope that they're right.)

4.1: Self Supervised Consistency Loss

Let's start off with a problem that MuZero shares with many model-free learning methods, even though MuZero is itself model-based: namely, it doesn't begin to learn until it receives non-uniform reward values.

All of the values that MuZero's network predicts -- the one-step reward, the state-value, the policy -- relate to the rewards from the environment. On one hand, this is good -- it means that the representations learned will necessarily be suitable for predicting the state-value and the policy, unlike representations learned over the course of predicting future observations. But on the other hand, this necessitates slow learning: it means that until the agent begins to get non-uniform rewards, the agent cannot learn.

This is obviously problematic in situations where non-zero rewards are rare. And it also presumably hurts performance even when rewards are relatively common.

To solve this problem, MuZero adds a new training target for the neural network. You can approximate it by saying that the state $s_{k}$ should be predictive of $s_{k + 1}$ .

The intuition is that the way you think about the world now should include anticipations of the way that you'll think about the world in the future. This is, as I mentioned in the section on historical background, pretty much exactly the same as intuition and implementation behind "Self Predictive Representations".

If you want to be slightly more technical: the state $s_{k}$ , fed through an action conditioned transformation network, should be predictive of an (initially random) projection from $s_{k + 1}$ . The network structure used to define this target is structured like the SimSiam feature-learning network, except it learns an equivalence across temporal steps and image augmentations rather than only an equivalence across image augmentations.

Stolen once more from the EfficientZero paper:

Self-Predictive Representations

So, essentially, the above pulls a transformation of the current state, fed through an action-conditioned dynamics network, close to a transformation from the future state. The way that the network in fact thinks about $s_{k + 1}$ should be something it can anticipate from how it thinks about $s_{k}$ ; the loss for training the neural network is produced from the difference between these two transformations. This loss is fed into the backpropagation through time when the network trains, along with the other three losses native to MuZero.

If we remove this from EfficientZero, overall median performance drops from 1.16 times better than human to a pathetic 0.34 of human performance; this is by far the largest drop.

Honestly, I'm... still puzzling over why this technique works so well.

Despite the just-so stories mentioned above about how intelligence surely involves prediction, I'm dubious about my ability to retrodict these results.

Here's part of the problem, as I see it. The basic architecture here dictates that the state representation should be, when training starts, predictive of an initially random projection of future states. But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There's a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there's a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I'm not sure why one turned out to be true rather than the other.

This ambiguity of why technical changes work is a persistent problem for deep learning. Batch normalization is a ubiquitous deep learning technique, but the story about why it works in the paper where it was published currently seems false. Its hard to make further progress even in capabilities when your understanding of the situation is fuzzy, let alone in interpretability.

This is the most important part of the paper and although the actual implementation details make perfect sense, I'm still not 100% sure why it works.

4.2: Value Prefix

Imagine that a knife is dropping towards your hand, blade down, from about five feet up. In the dark, so you cannot see it.

You anticipate, under the circumstances, some pain if you don’t act. You aren't sure exactly when the pain is going to occur; you just have a rough timeframe of “in the next second or so”. Which is all that you need for acting, assuming your hand isn't trapped in place.

A lot of RL algorithms, though, try to predict the exact moment that the pain occurs. This is difficult. MuZero tries to predict this, for instance -- the model tries to predict $r_{t}$ at precisely $t$ steps into the future. This is wasting computational power, in some sense -- you don't need to know exactly when some particular reward happens. And according to the authors, the uncertainty about the exact frame where pleasure or pain occurs impacts search with the MCTS, which in MuZero strongly depends on the model's predictions of when pleasure or pain occurs.

On the intuition that anticipating the pain (or pleasure) is important, but anticipating the exact timing of the pain (or pleasure) might not be, EfficientZero changes this somewhat. Rather than predicting the exact reward for a particular state in a particular timestep, $r_{t}$ , it instead tries to predict the cumulative reward for several states over a window of timesteps, $r_{t} + . . . . r_{n}$ . This learned value the paper calls the value prefix, and it is used, rather than the reward, in MCTS.

If you drop the value prefix learning from EfficientZero, median performance drops from 1.16x to 0.55x of the human median.

4.3: Off-Policy Correction

This third improvement does less than the other two. Dropping it from the model only lowers performance from 1.16x to 0.86x human median performance. Which, I should note, is still an enormous gain in sample-efficiency.

The motivation behind this addition is nevertheless the clearest of all three, I think. I'm most confident that the story about why it works is the actual reason it works.... which is a little depressing, but here goes anyhow.

As mentioned in the description of MuZero, when training the state-value part of the neural network, we construct targets for it from a combination of actual experienced rewards and predicted state-values on a later state.

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} . . . . + V (s_{t + n})$

Here, the target for $V (s_{k})$ is constructed from the actual rewards received by the agent,when the agent was acting in the environment over the remembered trajectory from t + 1, t + 2, etc. The estimated state value of $s_{t + n}$ is then added to the summed rewards so far, as part of the standard reinforcement learning bootstrapping.

Here's the problem. Suppose that this remembered trajectory comes from early on, when the agent was stupider.

If that's the case, then the trajectory here is likely sub-optimal. The current agent wouldn't take the same actions that the past agent took. That is, the rewards above are taken from an experienced trajectory that looks like this:

$s_{t}, r_{t}, a_{t}, s_{t + 1}, a_{t + 1}, r_{t + 1}, s_{t + 2}, a_{t + 2}, r_{t + 2} . . .$

In this experienced trajectory, then, the states and rewards experienced obviously depend on the actual actions taken. Which means that the sequence of rewards is likely not the same that the agent (in its more trained state) would take, just as the final state $s_{k + n}$ might not be the final state the agent (in its more trained state) would find itself in.

If the agent were in that situation now, it would take different actions; but it's still being updated towards a target as if it would have blindly done the same thing over and over.

The fix here is remarkably simple: the older an episode is, the shorter the imagined sequence of rewards off of which you bootstrap should be.

If the trajectory that you imagine, giving you the sequence of rewards shown above, is relatively recent, then you'd still probably make the same decisions now. So you can sample from a relatively long trajectory. Your update might look like this:

$V (s_{t}) \to r_{t + 1} + r_{t + 2} + r_{t + 3} + r_{t + 4} + V (s_{t + 5})$

On the other hand, if the trajectory that you imagine is relatively old, then you'd likely make different decisions now. So you should sample from a relatively short trajectory. Your update might look like this:

$V (s_{t}) \to r_{t + 1} + V (s_{t + 2})$

Doing so increases the accuracy of the value-target when learning from older saved trajectories. MuZero, when in sample-efficient mode, saves a lot of trajectories to learn from.

5: Conclusions

I'm a fan of how EfficientZero is three independent additions to MuZero piled on top of each other. Publishing this rather than the least publishable unit is quite laudable. At least, it's laudable from the perspective where we are concerned with academic Goodharting, rather than from the perspective where we are concerned with short AGI horizons.

There are a few expectations I take away from this work, generally.

First, I expect this work to be quickly surpassed and quickly built upon.

Other papers already implicitly suggest ways to improve EfficientZero. For instance, two days after EfficientZero came out, DeepMind released a paper on generalization with self-supervised world models. While this paper focuses on generalization rather than sample efficiency, it compares what is almost exactly the self-predictive representations of EfficientZero with several other related techniques.

One of these techniques, explicit contrastive learning, performs notably better than the self-predictive representations. ~~I expect that if you swapped out SPR with constrastive learning in EfficientZero, you'd get an immediate if modest bump in sample efficiency.~~ (Edit: I'm no longer expect this, see Ankesh's comment below.)

That's a very low hanging fruit, but there are others out there, easy to imagine with just a little effort, which are only a small step away in the high-dimensional space of RL design algorithms. In some cases you just need to take another step in the same direction. Not all of these imaginable alterations will work, but some of them will. A world where one paper contributes 3 useful improvements to a top model-based learning method, and where none of those improvements involves several large steps in design-space, is probably not a world where such improvements are rare.

And the compute requirements for EfficientZero were relatively modest; the thing holding people back from attempting to take these further steps is, honestly, probably neither the need for compute, nor the ability to think of potential improvements, but the engineering talent needed for performing the experiments quickly. And potentially, the lack of interest in sample-efficiency; the ease with which we can scale up RL algorithms to use billions of frames from multiple actors on different machines has probably contributed to a lack of research in this direction.

So, I forecast significant improvements to this within the timeframe of six to twenty-four months, in private if not in public. I expect at least 25% gain in sample-efficiency towards the start of this time period, and at least a 100% gain in sample-efficiency towards the end. I also expect attempts to extend these results to the entire Atari testbed of 57 games, with some success. I also expect attempts to get these results on larger, more interesting video games.

I expect that these efforts involving larger video games will work well. A 2020 paper showed that the MCTS family of algorithms in some sense approximated policy-optimization. Policy optimization has been used for large projects such as Dota2, so I expect that further EfficientZero work will also be successful in this area.

Second, it seems extremely likely that over the next one to four years, we'll see a shift away from sample-efficiency on these single-game test-beds, and on to sample efficiency in multi-task domains.

What makes me confident about this? Well, EfficientZero is very roughly 500 times more sample-efficient than the original Atari-playing Deep-Q-Network paper of 2015, which came out six years ago. If in six more years, we increased sample efficiency by another factor of 500, then the algorithms would be able to go from complete ignorance of the world to human-equivalent scores after less than thirty seconds of experience playing the game. This seems quite probably impossible, even for a superintelligence - you need more time with many games to figure out the rules alone.

I don't think this (necessarily) means that we'll have super-intelligent artificial intelligence in six years. I think instead it means that -- after further improvements mentioned above -- we'll hit diminishing returns on sample-efficiency in Atari, just like image classification hit diminishing returns on ImageNet. The only space where there will really be room to make interesting progress will be on bigger tasks, so interesting research will focus on these bigger tasks.

So sample-efficiency will turn to more complex domains, whether they be things like Starcraft or multi-task testbeds. MuZero-based methods might have some difficulty in some of these. In multi-task domains, for instance, MuZero will have problems because it learns state representations explicitly adapted for the rewards that it is receiving. Learning more reward-neutral representations could be challenging -- although I expect the challenge to be superable.

Third, and finally, I think this work is moderate to strong evidence that even without major conceptual breakthroughs, we're nowhere near the top of possible RL performance.

To repeat myself: a world where one paper contributes 3 useful improvements to a top model-based learning method, and where none of those improvements involves several large steps in design-space, looks to me very different than a world where these improvements are rare and hard to find. It seems to me likely that there are more improvements to be found without enormous Copernican shifts, and without the development of new back-propagation-level insights.

Great post! I think you might wanna emphasize just how crucial ReAnalyse is for data-efficiency (the default MuZero is quite sample in-efficient), and how the reanalyse-ratio can be tuned easily for any data budget using a log-linear scaling law. You can also interpret the off-policy correction thing as running ReAnalyse twice, so my TL;DR of EfficientZero would be "MuZero ReAnalyse + SPR".

Regarding contrastive vs SPR, I don't think you would find a performance boost using a contrastive loss compared to SPR on Atari at least. We did an ablation for this in the SPR paper (Table 6, appendix). I suspect the reason contrastive works (slightly) better on Procgen is because of the procedural diversity there making negative examples much more informative.

Definitely agree about moving to multi-task test beds as the next frontier in RL. I also suspect we would see more non tabula-rasa RL methods, ones that start off with general-purpose pre-trained models or representations and then only do a tiny amount of fine-tuning on the actual RL task.

Agreed, I added an extra paragraph emphasizing ReAnalyse. And thanks a ton for pointing that out that ablation, I had totally missed that.

I remain pretty happy with most of this, looking back -- I think this remains clear, accessible, and about as truthful as possible without getting too technical.

I do want to grade my conclusions / predictions, though.

(1). I predicted that this work would quickly be exceeded in sample efficiency. This was wrong -- it's been a bit over a year and EfficientZero is still SOTA on Atari. My 3-to-24-month timeframe hasn't run out, but I said that I expected "at least a 25% gain" towards the start of the time, which hasn't happened.

(2). There has been a shift to multitask domains, or to multi-benchmark papers. This wasn't too hard of a prediction, but I think it was correct. (Although of course good evidence for such a shift would require comprehensive lit review.)

To sample two -- DreamerV3 is a very recently released model-based DeepMind algorithm. It does very well at Atari100k -- it gets a better mean score then everything but EfficientZero -- but it also does well at DMLab + 4 other benchmarks + even crafting a Minecraft diamond. The paper emphasizes the robustness of the algorithm, and is right to do so -- once you get human-level sample efficiency on Atari100k, you really want to make sure you aren't just overfitting to that!

And course the infamous Gato is a multitask agent across host of different domains, although the ultimate impact of it remains unclear at the moment.

(3). And finally -- well, the last conclusion, that there is still a lot of space for big gains in performance in RL even without field-overturning new insights, is inevitably subjective. But I think the evidence still supports it.

Nice summary! I agree, this is an interesting paper :)

But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There's a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there's a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I'm not sure why one turned out to be true rather than the other.

This mostly depends on the size of your dataset. For very small datasets (100k frames here), the network is overparameterized and can easily overfit, adding the consistency loss provides regularisation that can prevent this.

For larger datasets (eg standard 200 million frame setting in Atari) you'll see less overfitting, and I would expect the impact of consistency loss to be much smaller, possibly negative. The paper doesn't include ablations for this, but I might test it if I have time.

To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.

Yes, that was the comment I meant to leave but apparently didn't: it's just another bias-variance tradeoff. In the limit (say, 20b frames...) all of these regularizations and auxiliary tasks (and/or informative priors) are either going to be neutral or hurt converged performance compared to pure end-to-end reward-only learning. And they should, if you do them right, help most early on when data is scarce and the end-to-end reward-only approach hasn't been able to learn much. This isn't post hoc, it's just what any ML person should predict from bias-variance tradeoff. The devil is in the details, though, and you could be doing any of it wrong or not be where you think you are in the tradeoff, and that's where this sort of research finding lives.

Ah, that does make sense, thanks. And yeah, it would be interesting to know what the curve / crossover point would look like for the impact from the consistency loss.

Epistemic Status: I don't actually know anything about machine learning or reinforcement learning and I'm just following your reasoning/explanation.

From each state, we can just check each possible action against the action-value function $q(s_t, a_t), and choose the action that returns the highest value from the action-value function. Greedy search against the action-value function for the optimal policy is thus equivalent to the optimal policy. For this reason, many algorithms try to learn the action-value function for the optimal policy.

This does not actually follow. Policies return probability distributions over actions ("strategies"), and it's not necessarily the case that the output of the optimal policy in the current state is a pure strategy.

Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist).

Though maybe for single player decision making, optimal play is never mixed strategies? For any mixed strategy, there may exist an action in that strategy's support (set of actions that the strategy assigns positive probability to) that has an expected return that is not lower than the strategy itself? I think this may be the case for deterministic environments, but I'm too tired to work out the maths right now.

IIRC randomised choice is mostly useful in multi-agent environments, environments where the environment has free variables in its transition rules that may sensitive to your actions (i.e. the environment itself can be profitably modelled as an agent [where the state transitions are its actions]), or is otherwise non deterministic/stochastic (including stochastic behaviour that arises from uncertainty).

So I think greedy search for the action that attains the highest value for the optimal policy's action value function is only equivalent to the optimal policy if the environment is:

Deterministic
Fully observable/the agent has perfect information
Agent knows all the "laws of physics"/state transition rules of the environment
Fixed low level state transitions that do not depend on agent

(I may be missing some other criteria necessary to completely obviate mixed strategies.)

I think these conditions are actually quite strong!

That's 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I'd expect naive action-value methods to have a lot of trouble in multi agent scenarios.

The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can't really do so. That's probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).

Re. the bullet list, that's a little too restrictive, at least in some places -- for instance, even if an agent doesn't know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)

I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it's quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.

Although of course in practice, my understanding is it's quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow

I'm sleep deprived as I wrote that/am writing this, so I may be making some technical errors.

The list was supposed to be conditions under which there (is guaranteed to) exist(s) an optimal policy that assigns a pure strategy to every state.

This doesn't rule out the existence of environments that don't meet all these criteria and nonetheless have optimal policies that assign pure strategies to some or all states. Such an optimal policy just isn't guaranteed to exist.

(Some games have pure Nash equilibria/but pure Nash equilibria are not guaranteed to exist in general.)

That said, knowing the laws of physics/transition rules was meant to cover the class of non stochastic environments with multiple possible state transitions from a given state and action. (Maybe one could say that such environments are non deterministic, but the state transitions could probably be modelled as fully deterministic if one added appropriate hidden state variables and/or allowed a state's transition to be path dependent.)

It's in this sense that the agent needs to know the transition rules of the environment for pure strategies to be optimal in general.

Curated. EfficientZero seems like an important advance, and I appreciate this post's length explanation, broken into sections that made it easy to skim past parts I already understood.

Thank you for writing this! I usually have to find a few different angles to look at a paper from before I feel like I understand it, and this kind of thing is super helpful.

To solve this problem, MuZero adds a new training target for the neural network

Should probably be edited MuZero -> EfficientZero.

Anyhow, great post. I enjoyed the EfficientZero paper and would recommend it to any other interested dilletantes, and this post did a good job putting things in the context of previous work.

The temporal dynamics used in the model seem to be really simple/cheap. Do you think they're just getting away with simple dynamics because Atari is simple, or do you think that even for hard problems we will find representations such that their dynamics are simple?

I know little about machine learning, so that may be a dumb question: in linguistic there is an argument called the poverty of stimulus. The claim is that children must figure out the rules of language using only a limited number of unlabeled examples. This is taken as evidence that the brain has some kind of hard-wired grammar framework, that serves as a canvas for further learning while growing up.

Is it possible that tools like EfficientZero help find the fundamental limits for how much training data you need to figure out a set of rules? If an artificial neural network ever manages to reconstruct the rules of English by using only the stimulus that the average children is exposed too, that would be a strong counter-argument against poverty of stimulus.

I would be careful using reinforcement learning to check for theoretical maximization of training data, given that plenty of agents generally do not start out with 0 bits of information about the environment. The shape of input data/action space is still useful information.

Even in designing the agent itself, it seems to me that general knowledge of human-related systems could be introduced into the architecture.

Selecting the architecture that gives us highest upper-bound for information utilization in a system is also, in some sense, inserting extra data.

At the start, I thought you were going to in a different direction:

Given that we are giving AIs frameworks, just how far can we go with that? And will they be useful outside of this or that set of games?

Is it possible that tools like EfficientZero help find the fundamental limits for how much training data you need to figure out a set of rules? If an artificial neural network ever manages to reconstruct the rules of English by using only the stimulus that the average children is exposed too, that would be a strong counter-argument against poverty of stimulus.

Unless, that artificial network comes with a grammar framework?

In section one, where you define the action-value function, you use for return where I believe you intended to use $G$ .

Thank you so much for writing this! Strong-upvoted.

ot+1

o(t+1) or o_(t+1)?

Thanks! This was a super informative read, it helped me grok a few things I didn't before.

The MCTS naming is surprising. Seems strange they're sticking with that name.

BTW, does anyone know of any similar write-ups to this on transformers?

Great post!

Do you mind if I ask you what is the amount of free parameters and training compute of EfficientZero?

I tried scanning the paper but didn't find them readily available.

In appendix A.6 they state "To train an Atari agent for 100k steps, it only needs 4 GPUs to train 7 hours." I don't think they provide a summary of the total number of parameters. Scanning the described architecture though, it does not look like a lot -- almost surely < 1B.

A single ALE game is just not that complex, so ALE models are never large. (Uber once did an interesting paper on quantifying how small the NNs could be for ALE games.) The parameter count will be much closer to 1m than 1b. You can also look at the layer types & sizes in Appendix A: even without trying to calculate out anything, with a few convolution layers and a normal-sized LSTM layer and not much else, there's simply no way it's anywhere near 1b.

(As is pretty much always the case. Models in DRL are usually small, especially compared to supervised/self-supervised stuff. I'm not sure there's even a single 'pure DRL' model which cracks the 1b scale. The biggest chonkers might be like, MetaMimic or AlphaZero or AlphaStar, which would be in the low hundreds? So that's probably why DRL papers are not in the habit of reporting parameter counts everywhere or scaling them up/down like you might assume these days. That'll have to change as more self-supervised models are used.)

The link here is dead, can you find a more up to date one? (if you copy-paste a screenshot into LessWrong editor it should successfully create it's own copy)

It's working for me? I disabled the cache in devtools and am still seeing it. It looks like it's hitting a LW-specific CDN also. (https://res.cloudinary.com/lesswrong-2-0/image/upload/v1674179321/mirroredImages/mRwJce3npmzbKfxws/kadwenfpnlvlswgldldd.png)

Why hasn't this algorithm had an impact on our lives yet? Why isn't this algorithm working with robotics yet?

This particular algorithm can't do long-term planning. It's reactive. It only plans about five steps out, which is very short in the videogame space and would be in robotics. And it only works in discrete spaces.

Both of those limitations have been addressed in related work but they haven't been integrated with this algorithm afaik.

But I agree that deep network robot control is going to work decently well, very soon.

Thanks for writing this! I have a few questions, to make sure I'm understanding the architecture correctly:

So, essentially, the above pulls a transformation of the current state, fed through an action-conditioned dynamics network, close to a transformation from the future state.

Is this transformation an image augmentation, like you mention regarding SimSiam? Is the transformation the same for both states? And is the dynamics network also trained in this step?

Is the representation network trained in the same operation that trains the state-value and action-value functions? If not, what is to stop the representation function from being extremely similar regardless of the actual state?

The third technique, the one where older episodes have fewer timesteps before using the state-value function to truncate the rewards, makes me wonder if there has been any research where the final policy "grades" the quality of previous actions. It seems like that could shed some light on why this sort of technique works.

How it could be connected with GPT-like language models?

In the most obvious fashion you could use a GPT style LM as your world model, then train some reasonably fast utility function (trained from some mech turk dataset say) to grade text quality, and then apply MCTS on top of it to get a writing agent with the ability to actually plan ahead what it's writing before outputting each token. Problem is the compute cost would be enormous, so this naive approach naturally isn't feasible yet given current techniques.

It looks like transformers started to eat reinforcement learning:
[Trajectory Transformer] Offline Reinforcement Learning as One Big Sequence Modeling Problem
Michael Janner, Qiyang Li, Sergey Levine
https://arxiv.org/abs/2106.02039

Karpathy had a tweeter thread about it. "The ongoing consolidation in AI is incredible. Thread: When I started ~decade ago vision, speech, natural language, reinforcement learning, etc. were completely separate; You couldn't read papers across areas - the approaches were completely different, often not even ML based."...

How enormous are we talking? A 10x multiplier compared to writing with GPT the usual way, or 1000x, or 100,000x? If it is 100x or less, I'm kinda surprised no one has done it yet. (Or maybe they have and it didn't work?)

So if the cost of your predictive model is X flops per timestep, and you run Y total simulated MCTS timesteps (across all evaluated branches) per physical timestep, then the total cost is XY.

We know that for EZ on Atari XY ~ 10^14 flops/s, and suspect that X is roughly 10^6 to 10^8 (see gwern's comment), so I think Y is on order 10^6 or more?

The market cost of GPT-3 is 8 cents per 1000 tokens, apparently. So using the same massive sim rollout as EZ with Y on order 10^6 equates to ~$80 per token. I'd have to re-read the EZ paper in more detail to get a better idea of how the planning horizon compares, but in general it does seems pretty clear that EZ doesn't scale well with model/environment complexity.

EDIT: Oh, actually it's quite possible that commercial GPT-3 is already doing multiple rollouts, so a better estimate would be to divide total training cost by total training tokens.

cost = 3x10^23 flops / 3x10^14 tokens = 10^9 flops/token, so roughly 10^15 flops/token with EZ style 10^6 MCTS. That seems doable but may not be because of batching. In other words to get that performance on GPUs you may need to run 1000 such models in parallel, so instead of 10 gpus you are talking about 10,000 - but not totally unreasonable .. hmm.

Thanks! Huh, that's weird. How could GPT-3's training cost 10^9 flops/token when it has more than 10^9 parameters? Was it only using 1% of its parameters for any given prediction?

Ah, right, it was 3x10^11 tokens, not 14. So it would actually be 10^18 flops/token with EZ style 10^6 MCTS.

New question: Why hasn't someone done a smaller GPT-3 with "merely" 10^3 MCTS? You could shave off e.g. 4 OOMs of cost this way, for a total of 10^14 flops/token.

Oh whoops my bad, yeah didn't double check that, added a few OOM to #tokens. (And in retrospect should expect training cost to be ~params^2 for this kind of model)

As to the 2nd question, at ~10^12 flops/token, which then becomes 10^15 for a batch of 1000 independent tokens, it already requires a number of GPUs or a small cluster to run in real-time, so even 10^3 MCTS is a significant ask - for what I assume isn't worth the cost (it may only be a sentence or two of lookahead, and I recall they already experimented with some simple non MCTS forms of rollouts, although that may have been the cortex variant).

Naively I'd expect that GPT-3 spending 1000x more compute per token on rollouts or MCTS would produce text that is noticeably more coherent, correct, etc. At least with some fine-tuning, I'd expect that. If it doesn't help much, or at all, that would be good to know!

So to be clear I do think some form of planning is an obvious route foward for NLMs like GPT-3, and that MCTS is the current default choice, but the benefit mostly comes from full training of the policy with MCTS. And 1000x GPT-3 training cost seems a tad steep; I'm pretty confident that better algorithms are available for much less than billions $.

Wanna venture any predictions? So far as far as I know there hasn't been any serious attempt to train a large language model with MCTS or planning; are you predicting there will be one in the next three years and that it will be exciting / set SOTAs?

I would argue that we have seen a lot of dramatic performance gains with large models using planning. We just call the random-shooting method of planning 'sampling' or 'best-of sampling'. When you roll out 20 or 100 answers to a math or translation problem with GPT-3 etc and you score them with likelihood to rank or feed them into the model for a classification to accept/reject, this is not MCTS, but is this not planning?

Good point, I do recall that happening and I do recall there were gainz... If even that simple method produces dramatic performance improvements, wanna venture any predictions about whether more advanced methods will be tried in the next three years and whether they'll lead to further dramatic improvements?

I'm less sure about that. There's the strange and troubling repetition trap, which still afflicts the largest models as far as we know. And as you know from the MCTS & DRL literature, planning has rapidly diminishing returns compared to using a better base model even in problems with extremely clear objectives. The current random-shoot/criticize/self-distill approach may get you most of the gains already.

Thanks. Why is the repetition trap strange and troubling? If I saw a word or phrase repeated thrice, and were asked to predict the next word, I'd just keep repeating that word or phrase. Heck, even if it were only repeated twice, I'd probably seriously consider it. Whose to say these LMs aren't behaving optimally?

It's strange and troubling because it means that any tree search using likelihood ultimately gets sucked into an attractor of a repetition trap, which is pragmatically bad (there are no complete/consistent search procedures - you can't just plan deeper to get monotonically more quality, no matter how much compute you're willing to spend), raises a lot of questions about what the predictions mean or do (if you can't score a sequence or tree by likelihood, no matter how high-quality your starting point, what does the likelihood even mean?!), means even the simplest sampling procedure may send the model haywire if it ever chances to repeat a word, doesn't seem to be improving with scaling, and doesn't afflict non-likelihood non-autoregressive* non-text models (eg if you want to argue 'oh sure, it's totally plausible to just predict sequences consisting of the same token like 'aaaaaa' repeated eternally, that's fine', why don't the, say, diffusion text models or image-GPT models also suffer from it?). I don't think humans would put much probability on such sequences, even conditionally: we'd think that at some point the sequence would stop, because why would there be such gibberish?

* I'm not even sure of this. I've noticed that I've never seen anyone talk about the repetition trap in bidirectional text models' generation of text, but does that mean they are immune for some reason where unidirectional models fall to it, or that just no one uses them for freeform text generation enough to bother noting it?

I don't think humans would put much probability on such sequences, even conditionally: we'd think that at some point the sequence would stop, because why would there be such gibberish?

I think the intuition behind your remark "why would there be such gibberish?" actually goes most of the way to explaining the repetition trap.

The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.

Once you're already in one, I don't think it's really so obvious that the repetition should eventually stop. Yes, that's what a human writer would do -- but a human writer wouldn't have produced the conditioning sequence to begin with.

We start out with a prior that puts high weight on "this belongs to some natural genre of text," and low weight on "this belongs to a weird hyper-repetitive 'genre' of text." But eventually, after enough bad predictions from the former and enough accurate predictions from the latter, we really ought to yield to the evidence and update. Eventually it should become clear that the question "why would there be such gibberish?" has some answer, since we keep observing "such gibberish" and not anything else.

But why does LM sampling enter the trap to begin with? I think there needs to be some "initial misstep," where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.

In other words, repetition is special because it's a way of going off-distribution where there is, nonetheless, a single "obvious" way to continue the text, and continuing it thus will keep you in the same off-distribution region. Whereas most ways of going off-distribution are just confusing, and don't have a legible structure the LM would have learned from in-distribution training.

I would expect scale to lower the probability of the "initial mistake," and thus reduce the fraction of samples that are repetitive (is this borne out in practice?). I don't expect scale to make LMs stop assigning high likelihood to repetitive continuations of unnaturally repetitive prefixes, since I think that's a conditionally correct judgment.

For practical purposes, I've found my custom sampler to be pretty effective solution, though sometimes the LM still "wins the fight," as in this amusing example.

Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don't believe you, I just would have expected otherwise, so I'm curious.

But why does LM sampling enter the trap to begin with? I think there needs to be some "initial misstep," where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.

I think that's a possible interpretation. I'm still not sure why it wouldn't affect all the other possible models, though, and it seems like we should also see the problem get better with model scaling as the 'misstep' estimation disappears. If you are sampling token by token, the probabilities from GPT-3 over the 51k BPEs ought to be much better than GPT-2 (also 51k BPEs, also English text scraped from the Internet) etc: after all, that is the token it has the very highest predictive accuracy on. How accurate does a model have to get on the initial token before the initial misstep stops screwing not just with tree search, but regular sampling too?

I would expect scale to lower the probability of the "initial mistake," and thus reduce the fraction of samples that are repetitive (is this borne out in practice?).

It doesn't really seem like it. I think if you have the impression that it is, it is because we use sampling strategies designed specifically to eliminate it, like top_p. Nucleus sampling definitely does tamp down on it, but I don't think it's perfect, and it's clearly a hack which doesn't fix the problem with tree search and introduces biases of its own just like top-k. Regular sampling still seems to go haywire. (I dunno if anyone has checked GPT-3 the way the nucleus sampling paper and others check GPT-2 and others. Would get expensive.)

Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don't believe you, I just would have expected otherwise, so I'm curious.

I've seen hundreds of iGPT completions and random examples, and not a single one ever just 'starts repeating' ad nauseam; nor has anyone ever pointed such failure cases out. (In contrast, the 'tessellation' that naive CLIP maximization causes is extremely obvious, and you can't sample GPT on naive non-top-p/repetition-penalization settings without running into it well before hundreds/thousands of examples.) Maybe I'm wrong and there is repetition at some level which isn't obvious to the naked eye, like high-frequency Fourier components (although I'm not sure how that would be possible with the superpixel approach), and if someone shows me iGPT (or DALL-E/CogView, or DDIM, or...) samples which are clearly the repetition trap in action, I'll change my mind but it's been years now, so I'm not holding my breath.

If repetitions arise from sampling merely due to high conditional probability given an initial "misstep", they should be avoidable in an MCTS that sought to maximize unconditional probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the "trap" once or a few times, it would simply avoid the unfortunate misstep in subsequent "playouts". From my understanding, that is.

Thanks! Very interesting point about the lack of this in image-GPT etc. I have no comment there, not understanding them on a technical level.

I totally think that humans would put lots of (conditional) probability on such sequences. It's true that we'd predict the sequence would stop eventually, but that's not relevant; what's relevant is: You see it repeating for N times so far. What's the probability that it goes on for at least N+1 times total? That probability goes up and up with N, not down and down, even though you are supremely confident that N is finite and even though your (unconditional) credence in the sequence as a whole goes down and down with N.

I think you may be correct that even humans would increase probability of a repetition continuance with N up to a point. The difference could be that humans are using a much larger compressed historical context, so when reading something like Moby Dick, the prior for any serious repetition is absurdly low, and it never comes up.

Also humans read fundamentally differently through vision, and even when the retina is focusing on just a word or two at a time, you are also getting some bits of signal for surrounding future text, and big repetitions would be fairly obvious.

Agreed, I added an extra paragraph emphasizing ReAnalyse. And thanks a ton for pointing that out that ablation, I had totally missed that.

I remain pretty happy with most of this, looking back -- I think this remains clear, accessible, and about as truthful as possible without getting too technical.

I do want to grade my conclusions / predictions, though.

And course the infamous Gato is a multitask agent across host of different domains, although the ultimate impact of it remains unclear at the moment.

Nice summary! I agree, this is an interesting paper :)

But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There's a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there's a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I'm not sure why one turned out to be true rather than the other.

To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.

Ah, that does make sense, thanks. And yeah, it would be interesting to know what the curve / crossover point would look like for the impact from the consistency loss.

Epistemic Status: I don't actually know anything about machine learning or reinforcement learning and I'm just following your reasoning/explanation.

From each state, we can just check each possible action against the action-value function $q(s_t, a_t), and choose the action that returns the highest value from the action-value function. Greedy search against the action-value function for the optimal policy is thus equivalent to the optimal policy. For this reason, many algorithms try to learn the action-value function for the optimal policy.

Mixed strategies are especially important and may be optimal in multi agent environments (a pure Nash equilibrium may not exist, but a mixed Nash equilibrium is guaranteed to exist).

So I think greedy search for the action that attains the highest value for the optimal policy's action value function is only equivalent to the optimal policy if the environment is:

Deterministic
Fully observable/the agent has perfect information
Agent knows all the "laws of physics"/state transition rules of the environment
Fixed low level state transitions that do not depend on agent

(I may be missing some other criteria necessary to completely obviate mixed strategies.)

I think these conditions are actually quite strong!

Although of course in practice, my understanding is it's quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow

I'm sleep deprived as I wrote that/am writing this, so I may be making some technical errors.

The list was supposed to be conditions under which there (is guaranteed to) exist(s) an optimal policy that assigns a pure strategy to every state.

(Some games have pure Nash equilibria/but pure Nash equilibria are not guaranteed to exist in general.)

It's in this sense that the agent needs to know the transition rules of the environment for pure strategies to be optimal in general.

Curated. EfficientZero seems like an important advance, and I appreciate this post's length explanation, broken into sections that made it easy to skim past parts I already understood.

Thank you for writing this! I usually have to find a few different angles to look at a paper from before I feel like I understand it, and this kind of thing is super helpful.

To solve this problem, MuZero adds a new training target for the neural network

Should probably be edited MuZero -> EfficientZero.

Anyhow, great post. I enjoyed the EfficientZero paper and would recommend it to any other interested dilletantes, and this post did a good job putting things in the context of previous work.

Even in designing the agent itself, it seems to me that general knowledge of human-related systems could be introduced into the architecture.

Selecting the architecture that gives us highest upper-bound for information utilization in a system is also, in some sense, inserting extra data.

At the start, I thought you were going to in a different direction:

Given that we are giving AIs frameworks, just how far can we go with that? And will they be useful outside of this or that set of games?

Is it possible that tools like EfficientZero help find the fundamental limits for how much training data you need to figure out a set of rules? If an artificial neural network ever manages to reconstruct the rules of English by using only the stimulus that the average children is exposed too, that would be a strong counter-argument against poverty of stimulus.

Unless, that artificial network comes with a grammar framework?

In section one, where you define the action-value function, you use for return where I believe you intended to use $G$ .

Thank you so much for writing this! Strong-upvoted.

ot+1

o(t+1) or o_(t+1)?

Thanks! This was a super informative read, it helped me grok a few things I didn't before.

The MCTS naming is surprising. Seems strange they're sticking with that name.

BTW, does anyone know of any similar write-ups to this on transformers?

Great post!

Do you mind if I ask you what is the amount of free parameters and training compute of EfficientZero?

I tried scanning the paper but didn't find them readily available.

The link here is dead, can you find a more up to date one? (if you copy-paste a screenshot into LessWrong editor it should successfully create it's own copy)

Why hasn't this algorithm had an impact on our lives yet? Why isn't this algorithm working with robotics yet?

Both of those limitations have been addressed in related work but they haven't been integrated with this algorithm afaik.

But I agree that deep network robot control is going to work decently well, very soon.

Thanks for writing this! I have a few questions, to make sure I'm understanding the architecture correctly:

So, essentially, the above pulls a transformation of the current state, fed through an action-conditioned dynamics network, close to a transformation from the future state.

Is this transformation an image augmentation, like you mention regarding SimSiam? Is the transformation the same for both states? And is the dynamics network also trained in this step?

How it could be connected with GPT-like language models?

So if the cost of your predictive model is X flops per timestep, and you run Y total simulated MCTS timesteps (across all evaluated branches) per physical timestep, then the total cost is XY.

We know that for EZ on Atari XY ~ 10^14 flops/s, and suspect that X is roughly 10^6 to 10^8 (see gwern's comment), so I think Y is on order 10^6 or more?

EDIT: Oh, actually it's quite possible that commercial GPT-3 is already doing multiple rollouts, so a better estimate would be to divide total training cost by total training tokens.

Thanks! Huh, that's weird. How could GPT-3's training cost 10^9 flops/token when it has more than 10^9 parameters? Was it only using 1% of its parameters for any given prediction?

Ah, right, it was 3x10^11 tokens, not 14. So it would actually be 10^18 flops/token with EZ style 10^6 MCTS.

New question: Why hasn't someone done a smaller GPT-3 with "merely" 10^3 MCTS? You could shave off e.g. 4 OOMs of cost this way, for a total of 10^14 flops/token.

Oh whoops my bad, yeah didn't double check that, added a few OOM to #tokens. (And in retrospect should expect training cost to be ~params^2 for this kind of model)

I don't think humans would put much probability on such sequences, even conditionally: we'd think that at some point the sequence would stop, because why would there be such gibberish?

I think the intuition behind your remark "why would there be such gibberish?" actually goes most of the way to explaining the repetition trap.

The key thing about pathologically repetitive sequences is that they are . . . pathologically repetitive, i.e. out-of-distribution for natural text.

For practical purposes, I've found my custom sampler to be pretty effective solution, though sometimes the LM still "wins the fight," as in this amusing example.

Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don't believe you, I just would have expected otherwise, so I'm curious.

But why does LM sampling enter the trap to begin with? I think there needs to be some "initial misstep," where a sampled token makes the text just a bit too repetitive. This makes further repetition more likely (because the text is oddly repetitive) and everything else less likely (because the text is odd / OOD), so further repetition occurs, which makes the text more OOD and makes repetition a steadily better bet, and so on.

I would expect scale to lower the probability of the "initial mistake," and thus reduce the fraction of samples that are repetitive (is this borne out in practice?).

Do you have a source for iGPTs not exhibiting the repetition trap? Not that I don't believe you, I just would have expected otherwise, so I'm curious.

Thanks! Very interesting point about the lack of this in image-GPT etc. I have no comment there, not understanding them on a technical level.

307

EfficientZero: How It Works

307

Ω 64

1: Reinforcement Learning Primer

2: Historical Background

3: MuZero: How It Works

3.1: The Neural Network

3.2: Monte Carlo Tree Search

3.3: MuZero's Training

4: EfficientZero: What It Changes

4.1: Self Supervised Consistency Loss

4.2: Value Prefix

4.3: Off-Policy Correction

5: Conclusions

307

Ω 64

307

Ω 64