[AN #140]: Theoretical models that predict scaling laws


Ω 22

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.


Explaining Neural Scaling Laws and A Neural Scaling Law from the Dimension of the Data Manifold (Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma) (summarized by Rohin): We’ve seen lots of empirical work on scaling laws (AN #87), but can we understand theoretically why these arise? This paper suggests two different models for how power-law scaling laws could arise, variance-limited and resolution-limited scaling, and argues that neural nets are typically trained in the resolution-limited setting. In both cases, we have versions that occur when the dataset size D is large and the number of parameters P is low (parameter-limited), and when D is low and P is large (data-limited).

Recall that a scaling law is a power-law equation that predicts the test loss L as a function of P and D. In this paper, we consider cases where only one of the resources is the bottleneck, so that our power laws are of the form L = kP^(-α) or L = kD^(-α), for constants k and α. (For simplicity, we’re assuming that the minimum value of our loss function is zero.)

Resolution-limited scaling happens when either the dataset is too small to “resolve” (capture) the true underlying function, or when the model doesn’t have enough capacity to “resolve” (fit) the training dataset. In this case, we’re going to take the common ML assumption that while our observation space might be high-dimensional, the data itself comes from a low-dimensional manifold with dimension d, called the intrinsic dimension. We’ll model our neural net as transforming the input space into a roughly d-dimensional representation of the manifold, which is then used in further processing by later layers. Thus the output of the network is a simple function over this low-dimensional representation.

Let’s first consider the case where P is sufficiently large, so that we perfectly fit the training data, but D is limited. We can think of the training data as a “net” of points covering the true d-dimensional manifold. Intuitively, to halve the distance between the points (making the net “twice as fine”), we need ~2^d times as many points. Some simple algebraic manipulation tells us that distance between points would then scale as D^(-1/d).

How can we translate this to the test loss? Let’s assume a simple nearest neighbor classifier where, given a test data point, we simply predict the value associated with the nearest training data point. This is equivalent to assuming that our neural net learns a piecewise constant function. In this case, for a test data point drawn from the same distribution as the training set, that data point will be “near” some training data point and our model will predict the same output as for the training data point.

Under the assumption that our test loss is sufficiently “nice”, we can do a Taylor expansion of the test loss around this nearest training data point and take just the first non-zero term. Since we have perfectly fit the training data, at the training data point, the loss is zero; and since the loss is minimized, the gradient is also zero. Thus, the first non-zero term is the second-order term, which is proportional to the square of the distance. So, we expect that our scaling law will look like kD^(-2/d), that is, α = 2/d.

The above case assumes that our model learns a piecewise constant function. However, neural nets with Relu activations learn piecewise linear functions. For this case, we can argue that since the neural network is interpolating linearly between the training points, any deviation of the distance between the true value and the actual value should scale as D^(-2/d) instead of D^(-1/d), since the linear term is being approximated by the neural network. In this case, for loss functions like the L2 loss, which are quadratic in the distance, we get that α = 4/d.

Note that it is possible that scaling could be even faster, e.g. because the underlying manifold is simple or has some nice structure that the neural network can quickly capture. So in general, we might expect α >= 2/d and for L2 loss α >= 4/d.

What about the case when P is the bottleneck? Well, in this case, since the training data is not the bottleneck, it is presumably a sufficiently good approximation to the underlying function; and so we are just seeing whether the learned model can match the dataset. Once again, we make the assumption that the learned model gives a piecewise linear approximation, which by the same argument suggests a scaling law of X^(-α), with α >= 2/d (and α >= 4/d for the case of L2 loss), where X is the number of “parts” in the approximation. In the case of linear models, we should have X = P, but for neural networks I believe the authors suggest that we should instead have X = w, the width of the network. (One motivation is that in the infinite-width limit, neural networks behave like linear models.)

In variance-limited scaling for D, the scaling bottleneck is the randomness inherent in the sampling of the dataset from the underlying distribution. We can view the dataset as a random variable, implying that the gradient is also a random variable since it is a function of the training dataset. We can then consider the “error term” δG = G - G_inf, which is the difference between the finite-dataset gradients and the gradients for infinite data. We’ll make the assumption that you’re equally likely to be wrong in all directions -- if there’s a dataset that makes you a bit more likely to predict A, then there’s also a corresponding equally likely dataset that makes you a bit less likely to predict A. In that case, in expectation δG is zero, since on average the errors all cancel out. Since D is assumed to be large, we can apply the law of large numbers to deduce that the variance of δG will scale as 1/D.

Let us then consider the test loss as a function of the gradients. The test loss we actually get is L(G) = L(G_inf + δG). We can now Taylor expand this to get an expansion which tells us that the quantity we care about, L(G) - L(G_inf), is of the form AδG + B(δG)^2, where A and B are constants that depend on derivatives of the test loss in the infinite dataset case. We had already concluded that E[δG] = 0, and E[(δG)^2] is just the variance and so scales as 1/D, which implies that α = 1.

Here’s a slightly less mathematical and more conceptual argument for the same thing (though note that this feels like a sketchier argument overall):

  1. Variance of the gradient scales as 1/D by the law of large numbers
  2. Thus standard deviation scales as 1/√D
  3. Thus the deviation of the empirical estimate of the gradients scales as 1/√D
  4. Thus the deviation of the neural net parameters scales as 1/√D
  5. Thus the deviation of the output of the final layer scales as 1/√D
  6. Any linear dependence on this deviation would cancel out in expectation, since the deviation could either increase or decrease the test loss. However, quadratic dependences would add together. These would scale as (1/√D)^2, that is, 1/D.

The authors also suggest that a similar argument can be applied to argue that for parameters, the loss scales as 1/w, where w is the width of the network. This is variance-limited scaling for P. This again relies on previous results showing that neural networks behave like linear models in the limit of infinite width.

The authors use this theory to make a bunch of predictions which they can then empirically test. I’ll only go through the most obvious test: independently measuring the scaling exponent α and the intrinsic dimension d, and checking whether α >= 4/d. In most cases, they find that it is quite close to equality. In the case of language modeling with GPT, they find that α is significantly larger than 4/d, which is still in accordance with the equality (though it is still relatively small -- language models just have a high intrinsic dimension). Variance-limited scaling is even easier to identify: we simply measure the scaling exponent α and check whether it is 1.

Rohin's opinion: This seems like a solid attack on a theory of scaling. As we discussed last week, it seems likely that any such theory must condition on some assumption about the “simplicity of reality”; in this paper, that assumption is that the data lies on a low-dimensional manifold within a high-dimensional observation space. This seems like a pretty natural place to start, though I do expect that it isn’t going to capture everything.

Note that many of the authors’ experiments are in teacher-student models. In these models, a large teacher neural network is first initialized to compute some random function; a student network must then learn to imitate the teacher, but has either limited data or limited parameters. The benefit is that they can precisely control factors like the intrinsic dimension d, but the downside is that it isn’t immediately clear that the insights will generalize to real-world tasks and datasets. Their experiments with more realistic tasks are less clean, though I would say that they support the theory.



Bootstrapped Alignment (G Gordon Worley III) (summarized by Rohin): This post distinguishes between three kinds of “alignment”:

1. Not building an AI system at all,

2. Building Friendly AI that will remain perfectly aligned for all time and capability levels,

3. Bootstrapped alignment, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.

The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.


Institutionalizing ethics in AI through broader impact requirements (Carina E. A. Prunkl et al) (summarized by Rohin): This short perspective analyzes the policy implemented by NeurIPS last year in which paper submissions were required to have a section discussing the broader impacts of the research. Potential benefits include anticipating potential impacts of research, acting to improve these impacts, reflecting on what research to do given the potential impacts, and improving coordination across the community. However, the policy may also lead to trivialization of ethics and governance (thinking that all the relevant thinking about impacts can be done in this single statement), negative attitudes towards the burden of writing such statements or responsible research in general, a false sense of security that the ethics are being handled, and a perception of ethics as something to be done as an afterthought.

The main challenges that can cause these sorts of negative effects are:

1. Analyzing broader impacts can be difficult and complex,

2. There are not yet any best practices or guidance,

3. There isn’t a clear explanation of the purpose of the statements, or transparency into how they will be evaluated,

4. It’s tempting to focus on the research that determines whether or not your paper is published, rather than the broader impacts statement which mostly does not affect decisions,

5. Researchers may have incentives to emphasize the beneficial impacts of their work and downplay the negative impacts.

6. Biases like motivated reasoning may affect the quality and comprehensiveness of impact statements.

To mitigate these challenges, the authors recommend improving transparency, setting expectations, providing guidance on how to write statements, improving incentives for creating good impact statements, and learning from experience through community deliberation. To improve incentives in particular, broader impact statements could be made an explicit part of peer review which can affect acceptance decisions. These reviews could be improved by involving experts in ethics and governance. Prizes could also be given for outstanding impact statements, similarly to best paper awards.

Rohin's opinion: I’ve been pretty skeptical of the requirement to write a broader impacts statement. My experience of it was primarily one of frustration, for a few reasons:

1. Forecasting the future is hard. I don’t expect a shallow effort to forecast to be all that correlated with the truth. There were lots of simple things I could say that “sound” right but that I don’t particularly expect to be true, such as “improving cooperation in multiagent RL will help build cooperative, helpful personal assistants”. It’s a lot harder to say things that are actually true; a real attempt to do this would typically be a paper in itself.

2. To the extent that the statement does affect reviews, I expect that reviewers want to hear the simple things that sound right; and if I don’t write them, it would probably be a strike against the paper.

3. Even if I did write a good statement, I don’t expect anyone to read it or care about it.

From a birds-eye view, I was also worried that if such statements do become popular, they’ll tend to ossify and build consensus around fairly shallow views that people come up with after just a bit of thought.

I do think many of the proposals in this paper would help quite a bit, and there probably is a version of these statements that I would like and endorse.



Mastering Atari with Discrete World Models (Danijar Hafner et al) (summarized by Flo): Model-based reinforcement learning can have better sample efficiency, allows for smarter exploration strategies, and facilitates generalization between different tasks. Still, previous attempts at model-based RL on the Atari Benchmark like Dreamer (AN #83) and SimPLe (AN #51) were unable to compete with model-free algorithms in terms of final performance. This paper presents DreamerV2, a model-based algorithm that outperforms DQN and its variants -- including Rainbow -- in terms of both median human- or gamer-normalized performance and on mean world-record normalized performance on Atari after 200M environment steps, achieving roughly 35% on the latter (25% if algorithm performance is clipped to max out at 100% for each game).

DreamerV2 learns a recurrent state-space model that stochastically encodes frames and a hidden state into a latent variable and uses the hidden state to predict the next value of the latent variable. Frames and reward are then reconstructed using both the hidden state and the latent variable. A policy is obtained by actor-critic training on the latent state space, leveraging parallelization to train on 468B imagined samples. As DreamerV2 does not use MCTS, it requires 8x less wall clock time to train than the more complicated but better performing MuZero Reanalyze (AN #75). Unlike earlier approaches, DreamerV2 uses a vector of categorical latent variables rather than gaussians to enable better model predictions for dynamics with multiple distinct modes, as well as KL-balancing (scaling up the importance of the transition loss compared to the entropy regularizer on the latent variable). Ablations confirm that the image reconstruction loss is crucial for DreamerV2's performance and that both the use of discrete latent variables and KL-balancing lead to significant improvements. Interestingly, preventing the gradients for reward prediction from affecting the world model does not affect performance at all.

Read more: Paper: Mastering Atari with Discrete World Models

Flo's opinion: It is worth noting that the authors use the Dopamine (AN #22) framework for evaluating the model-free baselines, meaning that a slightly stunted version of Rainbow is used on an evaluation protocol different from the original publication without retuning hyperparameters. That said, DreamerV2 definitely performs at a level similar to Rainbow, which is significant progress in model-based RL. In particular, the fact that the reward can be inferred from the world model even without gradients flowing back from the reward suggests transferability of the world models to different tasks with the same underlying dynamics.


A Theory of Universal Learning (Olivier Bousquet et al) (summarized by Zach): In machine learning, algorithms are presented with labeled examples of categories from a training dataset and the objective is to output a classifier that distinguishes categories on a validation dataset. The generalization ability of the classifier is usually measured by calculating the error rate of the classifications on the validation set. One popular way to display generalization capability as a function of training set size is to plot a learning curve. A learning curve is a function that outputs the performance of a learning algorithm as a function of the data distribution and training sample size. A faster decay rate for a learning curve indicates a better ability to generalize with fewer data.

In this paper, the authors characterize the conditions for a learning algorithm to have learning curves with a certain decay rate. A learning curve is produced from the decay rate according to the formula 1/rate. The authors show that there are only three universal rates: exponential, linear, and arbitrarily slow decay. Moreover, the authors show there are problem classes that can be learned quickly in each instance but are slow to learn in the worst-case. This stands in contrast to classical results which analyze only the worst-case performance of learning algorithms. This produces pessimistic bounds because the guarantee must hold for all possible data distributions. This is often stronger than what is necessary for practice. Thus, by looking at rates instead of the worst-case learning curve, the authors show that it is possible to learn more efficiently than what is predicted by classical theory.

Zach's opinion: This paper is mathematically sophisticated, but full of examples to illustrate the main points of the theory. More generally, work towards non-uniform bounds has become a popular topic recently as a result of classical generalization theory's inability to explain the success of deep learning and phenomena such as double-descent. These results could allow for progress in explaining the generalization capability of over-parameterized models, such as neural networks. Additionally, the theory presented here could lead to more efficient algorithms that take advantage of potential speedups over empirical risk minimization proved in the paper.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.


Ω 22

New Comment