In this post, we (Alexander Meulemans, David Lindner, Florian Dorner) propose to study scaling laws relating the generalization capability of Reinforcement Learning (RL) agents to problem complexity, inspired by DeepMind’s recent success in training agents that perform well in a range of environments. We motivate studying such scaling laws by highlighting variable problem complexity as a key difference between language modeling and RL, and then propose a few concrete experiments that we hope readers with some knowledge of reinforcement learning might be able to perform.
DeepMind recently trained “generally capable” agents behaving somewhat competently in an environment with procedurally generated maps and tasks called XLand, using open-ended reinforcement learning. This work has gotten a lot of attention and sparked discussions about the result’s relevance for AI timelines, including parallels to GPT being drawn. On one hand, the comparison to GPT makes sense, as the XLand results provide a proof of concept for achieving fairly general capabilities in the realm of agentic behaviour by using enough compute, just like GPT did in the realm of language modelling. On the other hand, it seems far from clear whether this advance in RL will have the same immediate economic effect language models seem to be having. As in language modelling, the training distribution for XLand will likely need to be related to the tasks the system is ultimately going to perform. But unlike in language modelling, where we train a system on large amounts of text from the same language(s) upstream tasks will be in, XLand still appears to be far from any real-world environment.
Therefore, to extrapolate from results in XLand, we don’t just require the same scaling laws relating performance to compute and dataset size we have for language models, but we also need to introduce an extra dimension for problem complexity and perhaps closeness to real world applications. As an analogy, imagine GPT-3 would have only been trained and evaluated on abstract language, or sentences beginning with a subject, using the vocabulary of basic English. How confident would you be in extrapolating from such a model’s scaling with compute to the less restricted language models we have at this point?
We believe that a better understanding of the relationship between problem complexity and the compute costs required for good generalization performance is a crucial crux for how we should update timelines based on the XLand results. Explaining neural scaling laws is a great first step in this direction, but it is unclear whether the concept of intrinsic dimension is sufficient for agents, which are usually not trained using supervised learning. In particular, it seems quite plausible that intricacies of Reinforcement Learning and Population Based Training, such as the use of bootstrapping, need for exploration, or evolutionary contingency could complicate scaling laws for agent performance.
To make initial progress on this problem, we propose to train agents on procedurally generated environments, using the same training algorithm, while varying problem complexity, model size, compute usage, and the amount of training samples. Then, the empirical relationship between these variables could be analyzed to derive empirical scaling laws, for example relating the amount of compute needed for a sufficient level of generalization performance to problem complexity.
In the rest of this post, we informally describe two versions of this project in more detail: An idealized version using XLand that actors with access to large amounts of compute could carry out, as well as a scaled-down and less resource-intensive version that is more realistic to pursue.
Experiment proposal in XLand
Say we had access to XLand and a large amount of computational resources, how could we study scaling laws about the generalization capability of trained agents? The basic scaling law that we are interested in would describe the amount of compute and data needed to achieve some amount of generalization performance for varying degrees of problem complexity. Hence, we need to properly define the terms sufficient generalization performance, problem complexity and amount of compute and data. Before doing so, we review some key terminology from the XLand paper:
- A world is a simulated region of 3-dimensional space. Worlds can have different topologies and contain both players and objects players can interact with.
- A goal is a predicate of the world state. In XLand, admissible goals can be written as a logical formula of a set of atomic predicates such as “Yellow sphere on green floor”.
- A game is defined as a tuple of n goals, one for each of n players.
- A task consists of a world, a game, and a set of n-1 agents with fixed policies, controlling one other player each.
Sufficient generalization performance
Open-ended learning could eventually be used to train agents that can perform robustly in a wide distribution of real-world applications that share an underlying structure, without the need for a lot of finetuning. One concrete example could be a control algorithm for a robotic arm that is able to perform a wide variety of manufacturing tasks, such that it can be directly integrated into existing assembly lines.
To obtain scaling laws for predicting when open-ended learning could produce such agents, we first need a measure of generalization performance. There are two important cases with a spectrum between them: (i) generalization to unseen environments and tasks within the training distribution and (ii), generalization to out-of-distribution environments and tasks.
- Within-distribution generalization: A straight-forward estimate for within-distribution generalization is to sample a large set of test tasks from the training distribution and measure a normalized score for each task. Then, sufficient generalization is reached when an agent’s normalized score is above a certain threshold for (almost) all test tasks. Focusing on the threshold rather than average performance makes sense, as we want to measure whether the agent is generally capable of solving a wide range of tasks, and a high average performance could be the result of high scores in specific tasks compensating for failures to generalize to other tasks. Lastly, it is important to ensure that there are sufficiently many test tasks with the desired level of complexity, even if simpler tasks are included in the training distribution for curriculum learning.
- Out-of-distribution generalization: Measuring out-of-distribution (OOD) generalization is much harder, and the focus should probably be on OOD tasks that are somewhat close to the training distribution, or have concrete relevance (as sequence classification has for language models). The former requires a meaningful measure of distance between tasks, which is difficult and could deserve a project on its own. Regarding the latter, we need arguments for both the relevance of the task, and that it is in fact OOD, but these arguments can be of a more qualitative nature. As a concrete example, the training environments could be limited to a subset of colors and objects, with new colors and objects introduced in the test tasks. In this case, the test tasks are obviously OOD, and are relevant if the downstream application requires the ability to handle novel objects. Similarly, the set of possible games can be limited during training, such that completely novel games can be introduced for testing. Once the test tasks are created, the measure for sufficient generalization performance can be constructed as in the within-distribution case.
Note that the specific characteristics of the scaling laws can heavily depend on the chosen performance measure. Hence, we recommend trying out different “natural” definitions of sufficient performance, and analyze whether this choice has a relevant impact on how compute requirements scale with complexity.
The aim of open-ended learning is to find a policy in policy space that performs robustly on a wide variety of tasks. Naively, the larger the policy space, the harder it might be in general to find a good policy and consequently the more data and compute resources might be needed. To provide a rough intuition, we can look at the stylized case of tabular RL with a deterministic policy. As the agent receives both the state and game description as inputs, the policy needs to define a specific action for each combination of state and game. This naively results in the following number of possible policies :
with being the action space, the state space, and the game space. Hence, we see that the magnitude of policy space (in this simplified case) scales exponentially with the magnitude of the state-space and game space, and polynomially with the magnitude of the action space. More realistically, a parameterized policy that can take advantage of similarities within states and games and is compatible with continuous spaces would be used. While this points in the direction of the exponential scaling depicted above being a severe overestimation, and the training signal might quickly rule out a large majority of policies, DeepMind's agents seem to use memory, which greatly inflates the amount of possible policies. Still, it appears useful to focus on three main dimensions of problem complexity: (1) the magnitude of the state space, (2) the magnitude of the game space and (3) the magnitude of the action space. In the multi-agent case, we have an extra fourth component: the number and policies of the other players in the environment. Let us briefly discuss the relevant aspects of each dimension in XLand:
- State space: The state space is determined by the area of the world the agents are playing in, its topological structure and the objects within the world. Hence, to vary the magnitude of the state space, one can make the area of the world bigger or smaller, or vary the allowed topological structures (e.g. the amount of possible elevation levels) and the amount of objects in the world. In addition, new types of objects could be introduced.
- Game space: As the goals in XLand are defined as a combination of predicates (e.g., hold an object, see a player, be in a specific area, etc.) the game space size can be varied by excluding or introducing types of predicates or changing how many predicates can be combined to define goals.
- Action space: The action space is defined by the degrees of freedom in movement of the agents and by the amount of available non-movement actions (e.g., shooting or picking up items). Varying the degrees of freedom for movement is probably difficult in XLand: Taking away degrees of freedom could prevent the agent from exploring the environment sufficiently, while introducing new degrees of freedom (e.g., jumping) could render some tasks trivial. The number of non-movement actions can likely be varied more easily. For example, agents could receive or lose the ability to throw objects for a short distance or shoot other players.
- Other agents: The problem complexity is also influenced by the number of players in the task and by the policies that they are using. As the other agents’ policies are a function of the self-play algorithm during training, it is hard to deliberately vary them, unless self-play is replaced by a predetermined curriculum of agents. As such, varying the amount of players appears more promising.
Amount of compute and data
As XLand is a simulated environment, there is no special cost associated with acquiring data and it can be incorporated in the general computational cost of the training. Hence, to track the amount of computational resources needed in training, we can estimate the number of floating point operations used during training, or more easily, track the GPU and CPU time for a fixed type of GPU and CPU. On the other hand, the network size or number of gradient updates and the amount of training samples could be varied independently to disentangle the contribution of the simulator and neural network training and better understand how things would change with more efficient simulators.
In summary, the aim of these experiments would be to find scaling laws for the amount of compute needed to reach a sufficient level of generalization performance as a function of problem complexity. As various aspects influence the complexity of the problem, a first approach is to investigate separate empirical scaling laws for each of the state-, action-, and game- dimensions. Next, the dependence of each of these laws’ form and parameters on the remaining aspects could be investigated. Ideally, there would be a simple formula relating the tuple of dimensions to the amount of necessary compute, perhaps even involving a plausible aggregate measure of complexity as an intermediate step.
Experiment proposals using less resources
So far, we discussed experiments using the algorithm and the environments proposed in the XLand paper. However, this is likely not feasible for most people reading this. Training agents in these environments is computationally very expensive and requires a large amount of resources. In addition, DeepMind did not release the XLand environment publicly. To make this proposal more accessible for a wide range of researchers, we discuss ideas for experiments that are easier to implement and perform, but could nevertheless yield insights about scaling laws for generalization.
The main bottleneck for implementing the approach proposed by DeepMind is the complexity of the environments and the resulting amount of resources required to train agents. Therefore, the most straightforward way to reduce the complexity of the experiments is to consider simpler environments. There is a fundamental trade-off here: we want to consider environments that are simple enough to implement and train agents on , but are complex enough to extrapolate insights to XLand and real-world applications.
Another question is how to design the training. It is likely that many components of the algorithm proposed by DeepMind are necessary for allowing the method to work in XLand, but are not crucial aspects of the method, and might look very different in future systems. Therefore, we propose to focus on the key components of V-MPO and (perhaps) population-based training, combined with simple network architectures.
One concrete type of environments that might have our desired properties could be 2-dimensional environments with point-mass dynamics, similar to OpenAI’s particle environments. While simpler than XLand, chiefly because of the missing third dimension, these environments allow for multi-agent interactions and the placement of objects, which allows for the definition of similar primitives as in the XLand paper, e.g., “agent one is close to the yellow square”. These primitives can then be combined to get a similar range of tasks as in XLand, presumably leading to interesting behavior due to the interaction between various agents with different goals and environment objects.
Now, we could use these environments to investigate scaling laws, by varying the different parameters of task complexity:
- State space: Change the size of the environment, or the number of objects in the environment.
- Action space: Give the agents different possible actions in addition to movement, e.g., picking up objects or interacting with each other.
- Game space: Change the number of primitives that can be used to define goals.
- Other agents: Change the number of players.
By measuring the generalization performance of the resulting agents as a function of these complexity parameters, we could gain a first indication whether scaling laws of the form that we discussed exist at all, and what form they might have. If the results of these small-scale experiments indicate that such scaling laws might exist, this could indicate that investigating them in more complex environments, like XLand, should have higher priority. Unfortunately it is unclear to what extent the results should be expected to generalize from these very simple environments to bigger experiments, so it would be crucial to continually investigate these scaling laws in bigger systems.
We proposed two different ways to investigate scaling laws of generalization performance of RL agents: (1) large-scale experiments in XLand, and (2) smaller experiments in point-mass environments. We believe that both would be valuable to better understand the amount of compute that might be needed to train general agents in more complex domains that are closer to economic relevance than XLand. Ultimately, it seems that (1) would be critical to really inform AI timeline estimates. However, (2) could provide important evidence on whether (1) would be useful.
All of these approaches come with a big amount of uncertainty. It is unclear if clean scaling laws exist at all for generalization performance in RL agents. If such laws existed, it would still remain unclear how useful they are given the large amount of independent dimensions of problem complexity, compared to the case of large language models. Would we need to treat these dimensions completely separately, or could scaling laws for the different dimensions be unified in a single (simple) formula? Finally, even if such unified scaling laws existed, real-world applications might not be easily modeled as more complex versions of XLand, and scaling might work differently for tasks that are closer to such applications.
Still, empirical scaling laws about RL generalization could be highly valuable in informing AI timeline estimates. Therefore, we think that such scaling laws could be a very impactful research direction, and we hope that some of our readers consider investigating some of the directions we layed out.
We would like to thank Robert Kirk, Tamay Besiroglu and Lukas Finnveden for feedback, and the Zurich AI Alignment reading group for a lively discussion that inspired this post. All mistakes are our own.
As an example, it seems quite unlikely that a system trained on English language only is going to be very useful for working with German text. ↩︎
Some complex learning problems might be a lot cheaper to obtain training data for than others, and it seems likely that problems involving interactions with humans are most relevant economically, but also on the expensive side. ↩︎
While we might ultimately want a three dimensional scaling law relating generalization performance to the amount of compute and to the problem complexity, we focus on a fixed performance measure. This is as performance is measured on a somewhat arbitrary scale, which could yield more complicated-looking scaling relationships and distract from the compute-complexity relationship. However, the performance history from training to a fixed given performance could be reused to investigate the three dimensional scaling law. ↩︎
Ideally, we would normalize the scores to the interval spanned by the performance of a random policy and the performance of an optimal policy for each task. In the XLand paper, the authors approximate the optimal policy using the best performing mixture of previously trained agents. ↩︎
If a continuous performance measure is required, the average normalized score within the lowest percentiles could be used. ↩︎
Depending on how exactly “out-of-distribution” is interpreted, sufficient generalization might not be attainable in this case, even for simpler environments. If that was true, attempts to understand scaling laws should probably focus on the within-distribution case until more progress has been made on OOD generalization. ↩︎
With the game modelled as part of the state space on the agent's side. ↩︎
In particular, some changes such as increasing the size of state space would likely increase the amount of compute needed for the simulator. ↩︎
This can quickly lead to a lot of needed experiments for covering the whole grid of the state-, action-, and game- dimensions. A possible solution for this would be to find a single aggregate measure of problem complexity, which can be validated on computationally cheap environments and then used for the expensive environments where it is not feasible to cover the whole grid. ↩︎
Such as the double exponential from earlier, or a related formula that additionally accounts for function approximation and multi-agent interaction. ↩︎
If DeepMind had used a large fraction of their available compute resources on XLand, rerunning the code with a large number of variations might not be feasible, even for them. ↩︎
For example, a small CNN could be used for memoryless policies, while a CNN could be combined with an LSTM for policies with memory. ↩︎
If no self-play is used, the range of policies used by the other agents can also be adapted. ↩︎