In this post, we (Alexander Meulemans, David Lindner, Florian Dorner) propose to study scaling laws relating the generalization capability of Reinforcement Learning (RL) agents to problem complexity, inspired by DeepMind’s recent success in training agents that perform well in a range of environments. We motivate studying such scaling laws by highlighting variable problem complexity as a key difference between language modeling and RL, and then propose a few concrete experiments that we hope readers with some knowledge of reinforcement learning might be able to perform.


DeepMind recently trained “generally capable” agents behaving somewhat competently in an environment with procedurally generated maps and tasks called XLand, using open-ended reinforcement learning. This work has gotten a lot of attention and sparked discussions about the result’s relevance for AI timelines, including parallels to GPT being drawn. On one hand, the comparison to GPT makes sense, as the XLand results provide a proof of concept for achieving fairly general capabilities in the realm of agentic behaviour by using enough compute, just like GPT did in the realm of language modelling. On the other hand, it seems far from clear whether this advance in RL will have the same immediate economic effect language models seem to be having. As in language modelling, the training distribution for XLand will likely need to be related to the tasks the system is ultimately going to perform[1]. But unlike in language modelling, where we train a system on large amounts of text from the same language(s) upstream tasks will be in, XLand still appears to be far from any real-world environment.

Therefore, to extrapolate from results in XLand, we don’t just require the same scaling laws relating performance to compute and dataset size we have for language models, but we also need to introduce an extra dimension for problem complexity[2] and perhaps closeness to real world applications[3]. As an analogy, imagine GPT-3 would have only been trained and evaluated on abstract language, or sentences beginning with a subject, using the vocabulary of basic English. How confident would you be in extrapolating from such a model’s scaling with compute to the less restricted language models we have at this point?

We believe that a better understanding of the relationship between problem complexity and the compute costs required for good generalization performance is a crucial crux for how we should update timelines based on the XLand results. Explaining neural scaling laws is a great first step in this direction, but it is unclear whether the concept of intrinsic dimension is sufficient for agents, which are usually not trained using supervised learning. In particular, it seems quite plausible that intricacies of Reinforcement Learning and Population Based Training, such as the use of bootstrapping, need for exploration, or evolutionary contingency could complicate scaling laws for agent performance.

To make initial progress on this problem, we propose to train agents on procedurally generated environments, using the same training algorithm, while varying problem complexity, model size, compute usage, and the amount of training samples. Then, the empirical relationship between these variables could be analyzed to derive empirical scaling laws, for example relating the amount of compute needed for a sufficient level of generalization performance to problem complexity[4].

In the rest of this post, we informally describe two versions of this project in more detail: An idealized version using XLand that actors with access to large amounts of compute could carry out, as well as a scaled-down and less resource-intensive version that is more realistic to pursue.

Experiment proposal in XLand

Say we had access to XLand and a large amount of computational resources, how could we study scaling laws about the generalization capability of trained agents? The basic scaling law that we are interested in would describe the amount of compute and data needed to achieve some amount of generalization performance for varying degrees of problem complexity. Hence, we need to properly define the terms sufficient generalization performance, problem complexity and amount of compute and data. Before doing so, we review some key terminology from the XLand paper:

  • A world is a simulated region of 3-dimensional space. Worlds can have different topologies and contain both players and objects players can interact with.
  • A goal is a predicate of the world state. In XLand, admissible goals can be written as a logical formula of a set of atomic predicates such as “Yellow sphere on green floor”.
  • A game is defined as a tuple of n goals, one for each of n players.
  • A task consists of a world, a game, and a set of n-1 agents with fixed policies, controlling one other player each.

Sufficient generalization performance
Open-ended learning could eventually be used to train agents that can perform robustly in a wide distribution of real-world applications that share an underlying structure, without the need for a lot of finetuning. One concrete example could be a control algorithm for a robotic arm that is able to perform a wide variety of manufacturing tasks, such that it can be directly integrated into existing assembly lines.

To obtain scaling laws for predicting when open-ended learning could produce such agents,  we first need a measure of generalization performance. There are two important cases with a spectrum between them: (i) generalization to unseen environments and tasks within the training distribution and (ii), generalization to out-of-distribution environments and tasks.

  • Within-distribution generalization: A straight-forward estimate for within-distribution generalization is to sample a large set of test tasks from the training distribution and measure a normalized score[5] for each task. Then, sufficient generalization is reached when an agent’s normalized score is above a certain threshold for (almost) all test tasks[6]. Focusing on the threshold rather than average performance makes sense, as we want to measure whether the agent is generally capable of solving a wide range of tasks, and a high average performance could be the result of high scores in specific tasks compensating for failures to generalize to other tasks. Lastly, it is important to ensure that there are sufficiently many test tasks with the desired level of complexity, even if simpler tasks are included in the training distribution for curriculum learning.
  • Out-of-distribution generalization: Measuring out-of-distribution (OOD) generalization is much harder, and the focus should probably be on OOD tasks that are somewhat close to the training distribution, or have concrete relevance (as sequence classification has for language models)[7]. The former requires a meaningful measure of distance between tasks, which is difficult and could deserve a project on its own. Regarding the latter, we need arguments for both the relevance of the task, and that it is in fact OOD, but these arguments can be of a more qualitative nature. As a concrete example, the training environments could be limited to a subset of colors and objects, with new colors and objects introduced in the test tasks. In this case, the test tasks are obviously OOD, and are relevant if the downstream application requires the ability to handle novel objects. Similarly, the set of possible games can be limited during training, such that completely novel games can be introduced for testing. Once the test tasks are created, the measure for sufficient generalization performance can be constructed as in the within-distribution case.

Note that the specific characteristics of the scaling laws can heavily depend on the chosen performance measure. Hence, we recommend trying out different “natural” definitions of sufficient performance, and analyze whether this choice has a relevant impact on how compute requirements scale with complexity.

Problem complexity

The aim of open-ended learning is to find a policy in policy space that performs robustly on a wide variety of tasks. Naively, the larger the policy space, the harder it might be in general to find a good policy and consequently the more data and compute resources might be needed. To provide a rough intuition, we can look at the stylized case of tabular RL with a deterministic policy. As the agent receives both the state and game description as inputs, the policy needs to define a specific action for each combination of state and game. This naively results in the following number of possible policies [8]:


with  being the action space,  the state space, and  the game space. Hence, we see that the magnitude of policy space (in this simplified case) scales exponentially with the magnitude of the state-space and game space, and polynomially with the magnitude of the action space. More realistically, a parameterized policy that can take advantage of similarities within states and games and is compatible with continuous spaces would be used. While this points in the direction of the exponential scaling depicted above being a severe overestimation, and the training signal might quickly rule out a large majority of policies, DeepMind's agents seem to use memory, which greatly inflates the amount of possible policies. Still, it appears useful to focus on three main dimensions of problem complexity: (1) the magnitude of the state space, (2) the magnitude of the game space and (3) the magnitude of the action space. In the multi-agent case, we have an extra fourth component: the number and policies of the other players in the environment. Let us briefly discuss the relevant aspects of each dimension in XLand:

  • State space: The state space is determined by the area of the world the agents are playing in, its topological structure and the objects within the world. Hence, to vary the magnitude of the state space, one can make the area of the world bigger or smaller[9], or vary the allowed topological structures (e.g. the amount of possible elevation levels) and the amount of objects in the world. In addition, new types of objects could be introduced.
  • Game space: As the goals in XLand are defined as a combination of predicates (e.g., hold an object, see a player, be in a specific area, etc.) the game space size can be varied by excluding or introducing types of predicates or changing how many predicates can be combined to define goals.
  • Action space: The action space is defined by the degrees of freedom in movement of the agents and by the amount of available non-movement actions (e.g., shooting or picking up items). Varying the degrees of freedom for movement is probably difficult in XLand: Taking away degrees of freedom could prevent the agent from exploring the environment sufficiently, while introducing new degrees of freedom (e.g., jumping) could render some tasks trivial. The number of non-movement actions can likely be varied more easily. For example, agents could receive or lose the ability to throw objects for a short distance or shoot other players.
  • Other agents: The problem complexity is also influenced by the number of players in the task and by the policies that they are using. As the other agents’ policies are a function of the self-play algorithm during training, it is hard to deliberately vary them, unless self-play is replaced by a predetermined curriculum of agents. As such, varying the amount of players appears more promising.

Amount of compute and data

As XLand is a simulated environment, there is no special cost associated with acquiring data and it can be incorporated in the general computational cost of the training. Hence, to track the amount of computational resources needed in training, we can estimate the number of floating point operations used during training, or more easily, track the GPU and CPU time for a fixed type of GPU and CPU. On the other hand, the network size or number of gradient updates and the amount of training samples could be varied independently to disentangle the contribution of the simulator and neural network training[10] and better understand how things would change with more efficient simulators.

In summary, the aim of these experiments would be to find scaling laws for the amount of compute needed to reach a sufficient level of generalization performance as a function of problem complexity. As various aspects influence the complexity of the problem, a first approach is to investigate separate empirical scaling laws for each of the state-, action-, and game- dimensions. Next, the dependence of each of these laws’ form and parameters on the remaining aspects could be investigated[11]. Ideally, there would be a simple formula relating the tuple of dimensions to the amount of necessary compute, perhaps even involving a plausible aggregate measure of complexity[12] as an intermediate step.

Experiment proposals using less resources

So far, we discussed experiments using the algorithm and the environments proposed in the XLand paper. However, this is likely not feasible for most people reading this. Training agents in these environments is computationally very expensive and requires a large amount of resources[13]. In addition, DeepMind did not release the XLand environment publicly. To make this proposal more accessible for a wide range of researchers, we discuss ideas for experiments that are easier to implement and perform, but could nevertheless yield insights about scaling laws for generalization.

The main bottleneck for implementing the approach proposed by DeepMind is the complexity of the environments and the resulting amount of resources required to train agents. Therefore, the most straightforward way to reduce the complexity of the experiments is to consider simpler environments. There is a fundamental trade-off here: we want to consider environments that are simple enough to implement and train agents on , but are complex enough to extrapolate insights to XLand and real-world applications.

Another question is how to design the training. It is likely that many components of the algorithm proposed by DeepMind[14] are necessary for allowing the method to work in XLand, but are not crucial aspects of the method, and might look very different in future systems. Therefore, we propose to focus on the key components of V-MPO and (perhaps) population-based training, combined with simple network architectures[15].

One concrete type of environments that might have our desired properties could be 2-dimensional environments with point-mass dynamics, similar to OpenAI’s particle environments. While simpler than XLand, chiefly because of the missing third dimension, these environments allow for multi-agent interactions and the placement of objects, which allows for the definition of similar primitives as in the XLand paper, e.g., “agent one is close to the yellow square”. These primitives can then be combined to get a similar range of tasks as in XLand, presumably leading to interesting behavior due to the interaction between various agents with different goals and environment objects.

Now, we could use these environments to investigate scaling laws, by varying the different parameters of task complexity:

  • State space: Change the size of the environment, or the number of objects in the environment.
  • Action space: Give the agents different possible actions in addition to movement, e.g., picking up objects or interacting with each other.
  • Game space: Change the number of primitives that can be used to define goals.
  • Other agents: Change the number of players[16].

By measuring the generalization performance of the resulting agents as a function of these complexity parameters, we could gain a first indication whether scaling laws of the form that we discussed exist at all, and what form they might have. If the results of these small-scale experiments indicate that such scaling laws might exist, this could indicate that investigating them in more complex environments, like XLand, should have higher priority. Unfortunately it is unclear to what extent the results should be expected to generalize from these very simple environments to bigger experiments, so it would be crucial to continually investigate these scaling laws in bigger systems.


We proposed two different ways to investigate scaling laws of generalization performance of RL agents: (1) large-scale experiments in XLand, and (2) smaller experiments in point-mass environments. We believe that both would be valuable to better understand the amount of compute that might be needed to train general agents in more complex domains that are closer to economic relevance than XLand. Ultimately, it seems that (1) would be critical to really inform AI timeline estimates. However, (2) could provide important evidence on whether (1) would be useful.

All of these approaches come with a big amount of uncertainty. It is unclear if clean scaling laws exist at all for generalization performance in RL agents. If such laws existed, it would still remain unclear how useful they are given the large amount of independent dimensions of problem complexity, compared to the case of large language models. Would we need to treat these dimensions completely separately, or could scaling laws for the different dimensions be unified in a single (simple) formula? Finally, even if such unified scaling laws existed, real-world applications might not be easily modeled as more complex versions of XLand, and scaling might work differently for tasks that are closer to such applications.

Still, empirical scaling laws about RL generalization could be highly valuable in informing AI timeline estimates. Therefore, we think that such scaling laws could be a very impactful research direction, and we hope that some of our readers consider investigating some of the directions we layed out.

We would like to thank Robert Kirk, Tamay Besiroglu and Lukas Finnveden for feedback, and the Zurich AI Alignment reading group for a lively discussion that inspired this post. All mistakes are our own.

  1. As an example, it seems quite unlikely that a system trained on English language only is going to be very useful for working with German text. ↩︎

  2. The only work in this direction we found relates AlphaZero’s performance in the board game Hex to the size of the game’s board. ↩︎

  3. Some complex learning problems might be a lot cheaper to obtain training data for than others, and it seems likely that problems involving interactions with humans are most relevant economically, but also on the expensive side. ↩︎

  4. While we might ultimately want a three dimensional scaling law relating generalization performance to the amount of compute and to the problem complexity, we focus on a fixed performance measure. This is as performance is measured on a somewhat arbitrary scale, which could yield more complicated-looking scaling relationships and distract from the compute-complexity relationship. However, the performance history from training to a fixed given performance could be reused to investigate the three dimensional scaling law. ↩︎

  5. Ideally, we would normalize the scores to the interval spanned by the performance of a random policy and the performance of an optimal policy for each task. In the XLand paper, the authors approximate the optimal policy using the best performing mixture of previously trained agents. ↩︎

  6. If a continuous performance measure is required, the average normalized score within the lowest percentiles could be used. ↩︎

  7. Depending on how exactly “out-of-distribution” is interpreted, sufficient generalization might not be attainable in this case, even for simpler environments. If that was true, attempts to understand scaling laws should probably focus on the within-distribution case until more progress has been made on OOD generalization. ↩︎

  8. With the game modelled as part of the state space on the agent's side. ↩︎

  9. Similar to how the board size of the game HEX has been varied in this paper. ↩︎

  10. In particular, some changes such as increasing the size of state space would likely increase the amount of compute needed for the simulator. ↩︎

  11. This can quickly lead to a lot of needed experiments for covering the whole grid of the state-, action-, and game- dimensions. A possible solution for this would be to find a single aggregate measure of problem complexity, which can be validated on computationally cheap environments and then used for the expensive environments where it is not feasible to cover the whole grid. ↩︎

  12. Such as the double exponential from earlier, or a related formula that additionally accounts for function approximation and multi-agent interaction. ↩︎

  13. If DeepMind had used a large fraction of their available compute resources on XLand, rerunning the code with a large number of variations might not be feasible, even for them. ↩︎

  14. For more details on the training process, see this post. ↩︎

  15. For example, a small CNN could be used for memoryless policies, while a CNN could be combined with an LSTM for policies with memory. ↩︎

  16. If no self-play is used, the range of policies used by the other agents can also be adapted. ↩︎

New to LessWrong?

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 1:38 PM

I was thinking: "what scaling laws are we missing?" We have scaling laws relating compute/n/parameters, data noise and compute/n, RL performance, RL training vs runtime search compute, scaling performance in various modalities like audio comprehension, and so on. But as I keep pointing out, we completely lack any scaling laws for the important (dangerous) downstream capabilities, phase transitions/capability spikes, or relating diversity of data to induced generalization (ie. meta-learning, when generalizing over rich complex data like human-written language etc).

One pre-law phenomenon is that with language models, some (extremely small) datasets are much more equal than others in perhaps not changing perplexity all that much but in making the model 'generalize better than it has any right to' (as one person described InstructGPT as compared to its base GPT-3 model)*. Presumably those capabilities were latent in the model (because the datasets can be so small they can't be teaching the model much 'from scratch') and near the 'surface' but can be bubbled up with the right prodding - which would take a ton of regular untargeted raw Internet data dumps to do because the hard instances are so rare & drowned out by easier gains which reduce perplexity but by doing uninteresting things like learning more real-world verbal catchphrases & facts. I also see this in the blessings of scale: more diverse problems are good, not bad, for stabler & more generalizable learning, so you avoid the "we trained on only 500 levels of Procgen and it overfit and we needed to train on, it turned out, at least 1000 levels" trial-and-error process of discovery.

So a good scaling law would be: "what is the exchange rate between diversity of problems and sample size of problems?"

Presumably there is some sort of diminishing returns where training on a single problem can take an absurd amount of data, whereas using the same number of datapoints/compute/parameters on millions of problems would be able to finetune quickly. (This is more or less our justification for meta-learning/Bayesian RL/pretraining.) This would be similar to the train vs plan tradeoff in Jones's RL scaling work: you can make up for little training by doing a lot of planning, or vice versa, but the exchange rate means that something intermediate is optimal for a good game-playing agent - spend most of your compute at train time to train a great baseline expert, and then do a relatively shallow small search at play time to fix up any errors for that specific game.

This raises the question: "how do you define 'diversity' of problems, anyway?" It's hard to talk about 'diversity' of data when we can't even explain in what sense dumping in a bunch of NLP datasets is 'more diverse' than adding in some more random web pages from Common Crawl.

OP's proposal using environments like XLand which have knobs you can twiddle for domain randomization is a step towards that, but that is expensive, hard to get working, RL (which we always want to avoid if at all possible), and not too objective or quantifiable (some knobs are more important than others, not every knob setting is the same, and so on). The RL setting here probably isn't necessary because we see similar things in the supervised language model setting, so we probably can avoid it and maybe eliminate knobs and make it more objective.

A better tack might be to use smaller synthetic data for this question. Rosenfeld does a lot of investigation of how well NNs solve problems & converge (apparently, not very well compared to what is possible!) by using the old trick of lots of small randomly-initialized NNs as generative models, which you know can be solved by a NN because they were constructed by a NN, and probably reflect real-world problems because NNs fit real-world problems so well. See also Jacob Cannell on interpreting NNs as length-biased Turing machine search, and what could be more objective or rigorous than Turing machines?

So to do a diversity vs sample size scaling law in generalization, perhaps one could do something like this: generate m short Turing machines (either as random NNs or literally Turing machines in Brainfuck etc) which produce output; pretrain sequence-prediction Transformers on 1..n sequences, varying from 1..m how many TMs the sequences are drawn from; and then finetune on downstream tasks like source code or Common Crawl. The downstream performance should vary by number of Turing machines trained on, and one should be able to fit a scaling law to how much larger n has to be to equalize diversity = m vs m-1 pretraining datasets.

If that pans out, one could move on to trying to measure diversity of real datasets. Something like the lottery ticket hypothesis might be interesting here: a dataset is more diverse the more different subnetworks fit different subsets of the data, implying it's the composition of a lot of different small Turing machines. If a single lottery ticket fits most of the data, then it's not diverse, while if there are lots of different subsets with minimally overlapping subnetworks... This would be a kind of finding coresets, and you could pull out just a few examples of each subset to do the diversity pretraining.

* Examples:

Temporal Scaling Laws

How do agents learn with horizon length, or time of tasks?

Another missing scaling law is scaling with episode length in steps or with time. This has not been examined at all AFAIK (aside from the somewhat-related Transformer language model analyses looking at how well they make use of the context window---very poorly towards the end). The RL-related scaling work tends to assume a fixed problem with a 'length' either hardwired by the environment (Ms. Pacman, most ALE environments) or driven by the other player's strength (Hex, Go, chess), and asks how performance scales with number of episodes. This doesn't answer the question of how hard it is to learn longer versions of those games, which is either constant in all datapoints or is conflated with other things (a stronger opponent may lead to longer games, and make learning more expensive to maintain the power/log-curve, but is that due to the longer game or other aspects of a stronger opponent?).

The simple answer is that the longer the horizon, the harder credit assignment and the RL problem must be because there are more states and possibilities spread over the same reward, and bounds on agents like a tabular learning agent often include some nasty term like T2 or L2 (to pick the first one I found in my notes, O'Donoghue 2018's regret bound has a L3⁄2 with only a √ T for total sample-size/elapsed-timesteps). So we might naively argue that longer tasks are strictly harder than shorter tasks, and any scaling on a task implicitly bounds the longer versions too and any 'broader' task. This would have a lot of implications for AI capabilities and safety: we can focus on measuring short-term task performance, and we could know that any merely human-level AGI (as measured in the short-term) would 'fall apart' as its long-term performance would be much worse and its errors accumulate and it flailed around, and that whatever level of long-term performance is necessary to have major global impact will be preceded by much better short-term performance as a warning sign.

But 'regret' is not necessarily the right quantity: regret is basically the loss of your strategy compared to that of the ideal strategy, and says little about how well the ideal strategy copes with different timescales. Since the target regret is 0 (you match the ideal strategy), you can only do worse with longer durations since those are more time-steps on which to fall short of the ideal strategy and incur regret; you can never make up your past regrets, only cry about them over wine. As far as capabilities go, if even the ideal strategy scales poorly with increasing temporal durations, losing lots of reward, then obviously learning it won't do any better.

And we can immediately think of reasons why reward might not worsen monotonically with increasing duration. Moravec's paradox immediately comes to mind with its inversions where the broader task can be easier than a subtask: a chess AI can kick our ass at planning out scores of plies deep to the end of the game (and endgame databases are omniscient), but while it can foresee your inevitable checkmate an hour from now, a robotics AI would struggle to move the pawn reliably within 1s without knocking other pieces over. We can also easily imagine an 'AI CEO' which is an overgrown language model receiving and sending emails all day, skillfully imitating the rhetoric and strategy of great CEOs of the past and maneuvering its company through the shoals of years to come, even as it would be unable to fetch coffee for a visitor in the next minute or update a spreadsheet---something it must delegate to its personal assistant or its MBA intern. (Many people already believe CEOs often do little to justify their salary and could be replaced by something rather simpler than language models...) This is not necessarily because 'being a CEO' is intrinsically easier, in some sense, than 'fetching coffee', but may simply reflect that the data is much more abundant, and much more compact and preprocessed in linguistic form, for one task and not the other, and so in practice (but not theory) we got good AIs for one before the other. (There are probably billions of words dissecting CEO decisions, running case studies on past CEOs, CEOs giving speeches or investor calls etc, while any words used to describe fetching coffee omit the important low-level physics because we just don't know what our bodies do or compute when they fetch coffee.) It may also reflect the illusory nature of the 'state' or 'actions'. Is 'being a CEO' an extremely high-dimensional problem with billions of angles and variables to consider, all important, ticking along millisecond by millisecond or does it boil down to some more abstract problem much more analogous to Rock-Paper-Scissors played once a year (except if you lose a throw, you lose a billion dollars)?

There's also an observation worth making that sometimes long-term tasks can be easier than short-term versions of the same task because they offer more scope for control, and correction and recovery from errors. Consider games like Go or chess: an (over the table or correspondence) chess grandmaster might lose a game of bullet blitz chess that they would win, or a Go master might be able to anaconda-style squeeze you to death given unlimited moves whereas if there were an arbitrary time limit like 70 moves, they would have to engage in tactically reckless invasions that can backfire on them with huge loss of territory; these are games where the 'long' form would be much easier to maximize reward on than the short form, because there's more time to think and to accumulate small advantages. Or consider inner monologue in a GPT-like model for solving math or programming problems: an inner monologue solution will be much longer than the default model behavior of 'just print out the (wrong) answer', and so if there really were a strict temporal decrease in capability, one would have proven a priori than the inner-monologue solutions must be worse; this is not the case, however, the longer show-your-work/think-aloud completions can be much better than the shortest completion, and this is in part because it gives the model scope for breaking down hard problems to sub-problems, or even going back and correcting itself in some examples. Other monologues can take the form of guess-and-check, and of course you have instances like AlphaCode of brute-forcing solutions by generating lots of candidate solutions to test; all of these use up more time, and involve more temporal steps, in some sense, and get you capability gains. More broadly, an AI CEO can solve its long-term problems like 'hiring' even if it is not good at solving every individual short-term instance of 'should I hire this candidate or not', because it can observe a failed hire and fire them later on, and hire a new better candidate to fix the errors. Or it could just spend more upfront to make hires more likely to work out, paying for it compared to a human CEO out of all its manifold AI advantages (like working for potentially as little as the cost of its electricity and never needing to sleep). If an expert chess player is plopped down in a random game position which doesn't suit his style, he may not do very well (he is 'off-policy', one might say); if he can play from the start, then he can take his favorite move and bring to bear his full skill while 'on-policy'. Or a race-car driver might not have the mechanical reflexes to recover from driving straight into the wall at 200kph, but he doesn't need to because he just doesn't do that in the first place. There are many forking paths indeed. (AIs not being good enough to control errors and them compounding to send an episode into unobserved states is a major failing of imitation-learning approaches.)

And then there's the exploration problem: how much of the difficulty of long-term tasks for AIs is not that they are 'intrinsically' hard, but that they are hard to explore, and we are pretty bad at exploration right now? A RL toy problem is n-chains, where one must hop along a line to move n steps away before getting a reward; for the standard ε-greedy exploration approach, to observe a single reward, an agent would need to randomly move n times in a row toward the reward at possibly ε probability, which would take a lot of episodes, but if it had offline data of an expert, it could probably solve it without a single episode because it's not that hard to model or plan over.

Finally, we may just be surprised by how agent competencies evolve regardless of how true any of these considerations are. Consider OpenAI Five (or AlphaStar). Who would have expected that a simple LSTM RNN, whose time horizon for training i something like 1 minute, could learn what looks like remarkable planning and foresight? (Such as learning the Roshan gambit, which requires the entire team to fight the most powerful monster in the game, for a while, taking heavy damage while neglecting the enemy team & wild monsters, to get a single item, an 1UP, which will be used once on one team member maybe 20 or 30 minutes later in the middle of an all-out fight and only then pay back the early investment in fighting Roshan.)

Far from being a tight constraint in which one can place great forecasting faith and a useful safety mechanism, I'm unsure there's even a meaningful correlation between how "long-term" a task is and how badly we can expect AIs to do. So, it's not an easy question to answer even qualitatively. (And obviously, the possibility of very different capabilities at different time-scales is important to characterize for AI safety for precisely the same reasons that showing that short-term capabilities >> long-term capabilities would be important.)

Quantitatively, we're not sure where to begin. First, what should our 'temporal scaling laws' be in? (We have average reward, episode length, agent horizon length, number of episodes, number of steps in episode, compute, parameter-count, agent family of model-free (value vs policy?) vs model-based... To name just the obvious variables.) Do we expect the exponents to be the same and simply offset by a constant (which seems to be how most of the scaling laws wind up looking---different arch? different constant. added noise? additional constant. cross-modality transfer? lower constant, etc.)?

Setting up an empirical testbed is also hard. You want an environment which cleanly varies the length of time dynamics to stress long-term credit assignment but not change anything else.

  • Most ALE games are obviously unsuitable. An arcade game like Ms. Pacman or Breakout! typically has 'resets' which make it equivalent to a lot of short games. (If you rewrite Tic-Tac-Toe as playing 1,000 games in a row and only then receiving a reward in the range 0--1,000, I do not expect this to be much harder than a single game of Tic-Tac-Toe with 0/1 reward.) Feedforward agents with only some frame-stacking for 'history' do well.

  • Most of the MuJoCo environments like walking robots have a similar problem: little or no long-term dynamics, and certainly nothing with a convenient T knob on it. Similarly, feedforward works well, lots of history adds little or is a drawback.

  • DMLab-30 I am less familiar with, but seems similar (stuff like finding an apple in a maze).

  • 2-player games like Go or chess typically do not have any natural time limits: with chess, counting material after n moves is probably stupid (up to t=30 moves, all agents would learn is to huddle in a fortress and achieve a 0-0 tie), although with Go, counting territory after n moves might work and would be easy to get started with given all the existing Go agents. (Go agents like KataGo can be trained with very little compute at this point, particularly with curriculum learning on board size---although that too is an extraneous factor which must be carefully considered.)

    • Hidden-information 2-player games like poker might work, but the adversarial part obscures the use of history. Poker agents live or die based on how well they adapt to the immediate Nash equilibrium with the current opponent, not how well they reason about the deck based on cards revealed thus far.
  • Nethack: some FB researchers have been pushing Nethack as a testbed, and it has some potential here. DRL agents do very badly on it, so there is a lot of headroom.

    Nethack has many levels, which are not independent; levels are only partially observable and have a lot of persistent state (which does not reset), rewarding long-term memory. Nethack players typically need to cache items on levels they have cleared and occasionally revisit to stock up; to 'ascend' and formally beat the game requires a lot of planning to kill the relevant enemies spread throughout the game and obtain the magical items which give one access to the final level. There is also the vital matter of building a strong character with appropriate curses/blessings/powers like being able to see in the dark or having complementary weapons/armor, and different character archetypes have different strength profiles over the game (in typical D&D fashion, wizards are weak and warriors strong early on, and then wizards gradually become OP while warriors stagnate). So there is a sense in which descent through the dungeon = progress, and longer games = better players, and actions early on can gimp one much later.

    But Nethack has drawbacks in being unnecessarily complex and more of an exploration problem than temporal reasoning problem (how is a DRL agent supposed to ever learn about tricks like 'writing the word "Elbereth" on the ground'? even humans never discover that one by themselves!) especially when we look at the hand-scripted 'symbolic' agents which do things like plug in Sokoban solvers solely for the mini-game, the 'score' of Nethack appears to be a poor proxy for solving the game and leads agents to greedily maximize grinding the early levels, probably more compute-expensive than a streamlined rogue-like designed for GPU/TPU acceleration would be, having rather complicated tooling/UI (related to never being intended as a research tool), and not having any clear T knob. The MiniHack games might be better than full Nethack.

  • Minecraft is a hard testbed for long-term planning, hierarchical actions, and imitation learning, but most of the hierarchies are relatively shallow and don't come with any built-in way to turn a T knob; one could add on a simple generative system to generate random DAGs and associate some voxel art with them to create extended versions of Minecraft?

  • This would be similar to Alchemy. Alchemy, as presented, seems to fix the hardness of the problem in terms of potions/stones, but if one varied that and increased the number of permitted trials to keep each instance soluble, perhaps that would be equivalent? Or would the inferential problem overwhelm any effect from reasoning over the history of experiments (similar to poker)?

None of these jump out.

In supervised learning, the obvious thing to do would be something like set up randomized n-grams as a Markov chain model to generate synthetic data and examine how predictive performance degrades as a function of n. ("Learning Curve Theory", Hutter 2021, does something a bit like this, I think.) In a RL setting, could we set up a n-chain or Block World environment with simple programs explicitly allocating reward over time? Such as a reward function expressed as a big matrix of Time vs States, or possibly a (very) small randomly-initialized NN (the way we use randomized NNs as ground-truth generative models in analyses like Rosenfeld 2021 to measure performance of NN classifiers)?

For example, perhaps the shortest reward-function program gives 1 reward for moving left (or right) the current T; then the next length does it only if one moves left twice over 2 time-steps; then the next length might require 2 lefts then 1 right; and so on. (In a Block World environment, the various walls and blocks would obstruct movement, requiring some degree of planning and environment manipulation. Or perhaps these can be plugged into MiniHack?) For temporal scaling, one can then use the intrinsic time length of these reward-function programs as the T knob to tweak, run hyperparameter sweeps to find agents which hit the known ceiling (we can easily figure out the optimal sequence for each reward-function and estimate optimal reward as just doing it in cycles until the end of the episode), as fast as possible averaged across a sample of randomized rewards & Block Worlds for each T (increasing T by an OOM or so each time since we expect some sort of log/power-law), and derive a relationship of 'Time' → '# of episodes (for optimally-scaled agent)'. To tie it back to the diversity proposal, in this case, our little TMs are sampling TMs over time and are 'deep', whereas before they were TMs over the other dimensions of the environment, space, and are 'wide'. In the time-case, we would leave the environment alone as much as possible and pack all the added difficulty into the time dimension/complexity of action sequences. This is a lot of runs, but it would all fit on-GPU and can be done with very small FC NNs conditioned on the history, so like Jones's Hex work, should be quite feasible.

You might find this work interesting, which takes some small steps in this direction. It studies the effect of horizon length inasmuch as it makes credit assignment harder, showing that the number of samples required is an affine function of horizon length in a toy context.

Understanding the scaling laws would improve our ability to make estimates, but would also provide information useful for increasing capabilities by making it clearer how much resources should be invested in scaling vs. how much effort should be invested elsewhere. Obviously, someone is going to try this soon, but given that the capabilities of the safety community seem to be increasing exponentially, small bits of extra time seem to actually matter.

(If OpenAI turns out to be net-positive, I suspect that it'll largely be because it managed to capture the resources/reputation that were always going to accrue to whoever was the first party to explore scaling and use them to hire a safety team/increase the status of AI safety. If this project is worth pursuing, I suspect it will have been largely because it would also allow you to capture a significant amount of resources/reputation).

Most recent large safety projects seem to be focused on language models. So in case the evidence pointed towards problem complexity not mattering that much, I would expect the shift in prioritization towards more RL-safety research to outweigh the effect on capability improvements (especially for the small version of the project, about which larger actors might not care that much). I am also sceptical whether the capabilities of the safety community are in fact increasing exponentially.

I am also confused about the resources/reputation framing. To me this is a lot more about making better predictions when we will get to transformative AI, and how this AI might work, such that we can use the available resources as efficiently as possible by prioritizing the right kind of work and hedging for different scenarios to an appropriate degree. This is particularly true for the scenario where complexity matters a lot (which I find overwhelmingly likely), in which too much focus on very short timelines might be somewhat costly (obviously none of these experiements can remotely rule out short timelines, but I do expect that they could attenuate how much people update on the XLand results).

Still, I do agree that it might make sense to publish any results on this somewhat cautiously.

I'm extremely reluctant to trade off any acceleration of AI for an increase in forecasting ability because:

a) I'm skeptical of how accurately AI can be forecasted
b) I'm skeptical of how much people's actions will change based on forecasts, especially since people are skeptical of how accurately it can be forecast

The biggest and most reliable updates from technical research will probably come from "We tried X and it was a lot easier than we thought it'd be" (ie. GPT3,  AlphaGo) which involves accelerating capabilities. On the other hand, we tried X and didn't make a lot of progress is less persuasive as maybe it'd have worked with a minor change.

"I am also confused about the resources/reputation framing. To me this is a lot more about making better predictions" - I know you're focused on improving predictions - I'm just explaining my personal opinion resources/reputation are more likely to be worth the risk of accelerating timelines.

Your point b) seems like it should also make you somewhat sceptical of any of this accelerating AI capabilities, unless you belief that capabilities-focused actors would change their actions based on forecasts, while safety-focused actors wouldn't. Obviously, this is a matter of degree, and it could be the case that the same amount of action-changing by both actors still leads to worse outcomes.

I think that if OpenAI unveiled GPT4 and it did not perform noticeably better than GPT3 despite a lot more parameters, that would be a somewhat important update. And it seems like a similar kind of update could be produced by well-conducted research on scaling laws for complexity.

"I think that if OpenAI unveiled GPT4 and it did not perform noticeably better than GPT3 despite a lot more parameters, that would be a somewhat important update"

Yeah, that would be an important update. But not as much as if it had worked since there might be a slight tweak that would make it work.

In the rest of this post, we informally describe two versions of this project in more detail: An idealized version using XLand that actors with access to large amounts of compute could carry out, as well as a scaled-down and less resource-intensive version that is more realistic to pursue.

I wouldn't use or try to reproduce XLand for open research. Procgen or Obstacle Tower might make more sense. Griddy, Alchemy, Meta-World, come to mind; or Minihack is an interesting new choice as a kind of procedurally-generated Nethack (there's also the new Crafter but unclear to me if it'd let you really mix up item types for curriculum/generation of rules rather than merely new levels).

As an analogy, imagine GPT-3 would have only been trained and evaluated on abstract language, or sentences beginning with a subject, using the vocabulary of basic English. How confident would you be in extrapolating from such a model’s scaling with compute to the less restricted language models we have at this point?

Pretty confident? I mean, you've read the scaling papers. You know the upshot is that scaling curves are remarkably universal cross-modality, cross-architecture, cross-dataset, and cross-noise levels: they all look like straight lines on log graphs. The differences are usually in the constants, not the exponents, so I would expect a subsetted GPT-3 to have a lower constant in prediction loss on its constrained dataset and be that much closer to its intrinsic entropy, but otherwise extrapolate its scaling laws of loss vs parameters/FLOPS/n the same as always.

I think you might be trying to say that we would have more doubts about its performance on downstream tasks, such as benchmarks, and whether abilities like meta-learning would be induced? There, yes, we don't have good ideas as to how much diversity matters or even how to measure either diversity or downstream tasks. (One would expect there to be some sort of bias-variance-esque tradeoff where diversity is good but the data can't be too sparse as to be unlearnable: if pure diversity were the only goal, then you'd expect datacleaning would harm both perplexity & downstream performance, but we know that for language models, cleaning Internet data pays off. Beyond this, hard to say. We know that you probably want to broaden environments/data as you accumulate more and more data in any specific environment or task as they saturate, but...)

As an example, it seems quite unlikely that a system trained on English language only is going to be very useful for working with German text.

En->De: "Scaling Laws for Language Transfer Learning", Christina Kim (as expected from Hernandez et al 2021).

If DeepMind had used a large fraction of their available compute resources on XLand, rerunning the code with a large number of variations might not be feasible, even for them. ?

This seems like a very good research direction. I’m not too familiar with RL, so I probably won’t pursue it myself, though. I do have three suggestions:

  1. For testing out of distribution transfer, one option is to move the agents to a different environmental simulator. I expect this will do a better job of representing the distributional shift incurred by deploying into the real world.
  2. Consider feeding the agents their goals via natural language instruction, and accompany their RL training with BERT-like language modeling. Agents trained to follow natural language instructions seem vastly more useful (and slightly more aligned) than agents limited to receiving instructions via bitcodes, which is what DeepMind did. I expect future versions of XLand-like RL pretraining will do something like this.
  3. Consider using the Perceiver IO architecture ( It’s a new variant transformer with linear time complexity in its input length, is explicitly designed to smoothly handle inputs of arbitrary dimensions and mixed modalities, and can also produce outputs of arbitrary dimensions and mixed modalities. I think it will turn out to be far more flexible than current transformers/CNNs.

Thank you!

  1. I agree that switching the simulator could be useful where feasible (you'd need another simulator with compatible state- and action-spaces and somewhat similar dynamics.)

  2. It indeed seems pretty plausible that instructions will be given in natural language in the future. However, I am not sure that would affect scaling very much, so I'd focus scaling experiments on the simpler case without NLP for which learning has already been shown to work.

  3. IIRC, transformers can be quite difficult to get to work in an RL setting. Perhaps this is different for PIO, but I cannot find any statements about this in the paper you link.

Especially for such needs and ideas we have recently released an extremely efficient reimplementation of XLand based on MiniGrid, XLand-MiniGrid. We completely rewrote it in JAX, so a simple PPO can get millions of FPS during training. On 8 a100 GPUs users can expect to get 1 trillion steps in under 40 hours. Have fun experimenting!

Twitter announcement:
Source code: