Same person as nostalgebraist2point0, but now I have my account back.


Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Most complexity measures give roughly similar values for the (relative) complexity of most objects


I'll write mostly about this statement, as I think it's the crux of our disagreement.

The statement may be true as long as we hold the meaning of "objects" constant as we vary the complexity measure.

However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can't simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects.  Whether this is true depends on our choice of translation.

(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object.  Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects.  But the example shows that the claim about complexity measures will only hold if our translation is "good enough" in some sense.  If we don't have any idea what "good enough" means, something is missing from the story.)

In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure.  (More precisely, the hand-waviness about what metric structure survives the translation, if any.)  If there's no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object "close to" the specified one.

Basic information theory doesn't require a metric, only a measure.  There's no sense of "getting an output approximately right," only of "getting the exactly right output with high probability."  If you care about being approximately right according to some metric, this leads you to rate-distortion theory.

Both of these domains -- information theory without a metric, and with one -- define notions of incompressibility/complexity, but they're different.  Consider two distributions on R:

  1. The standard normal,
  2. The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R

According to basic information theory, these are equally simple/compressible.  (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)

But in rate-distortion theory, (1) is way more simple/compressible than (2).  If you're coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1].  Whereas if you're coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.

If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities.  When you remove the metric, they're identical.  So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Like Rohin, I'm not impressed with the information theoretic side of this work.

Specifically, I'm wary of the focus on measuring complexity for functions between finite sets, such as binary functions.

Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n.  The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set.  I don't think this captures the kinds of function complexity we care about for NNs.


  • If  are finite sets, then there are a finite number of functions .   Let's write  for the finite set of such functions.
  • The authors view the counting measure on  -- where every function is equally likely -- as "unbiased."
  • This choice makes sense if  are truly unstructured collections of objects with no intrinsic meaning.
  • However, if there is some extra structure on them like a metric, it's no longer clear that "all functions are equally likely" is the right reference point.
  • Imposing a constraint that functions should use/respect the extra structure, even in some mild way like continuity, may pick out a tiny subset of  relative to the counting measure.
  • Finally, if we pick a measure of simplicity that happens to judge this subset to be unusually simple, then any prior that prefers mildly reasonable functions (eg continuous ones) will look like a simplicity prior.

This is much too coarse a lens for distinguishing NNs from other statistical learning techniques, since all of them are generally going to involve putting a metric on the input space.

Let's see how this goes wrong in the Shannon entropy argument from this paper.

  • The authors consider (a quantity equivalent to) the fraction of inputs in  for which a given function outputs .
  • They consider a function simpler if this fraction is close to 1 or 0, because then it's easier to compress.
  • With the counting measure, "most" functions output  about half of the time.  (Like the binomial distribution -- there are lots of different ways you can flip 5 tails and 5 heads, but only one way to flip 10 heads.)
  • To learn binary functions with an NN, they encode the inputs as binary vectors like .  They study what happens when you feed these to either (A) linear model, or (B) a ReLu stack, with random weights.
  • It turns out that the functions expressed by these models are much more likely than the counting measure to assign a single label ( or ) to most outputs.
  • Why?
    • For an random function on an input space of size , you need to roll  independent random variables.  Each roll affects only one input element.
    • But when you encode the inputs as vectors of length  and feed them into a model, the layers of the model have weights that are also -vectors.  Each of their components affects many input elements at once, in the same direction.  This makes it likely for the judgments to clump towards  or .
    • For example, with the linear model with no threshold, if we roll a weight vector whose elements are all positive, then every input maps to .  This happens a fraction  of the time.  But only one boolean function maps every input to , so the counting measure would give this probability .
    • This doesn't seem like a special property of neural nets.  It just seems like a result of assigning a normed vector space structure to the inputs, and preferring functions that "use" the structure in their labeling rule.  "Using" the structure means any decision you make about how to treat one input element has implications for others (because they're close to it, or point in the same direction, or something).  Thus you have fewer independent decisions to make, and there's a higher probability they all push in the same direction.

Sort of similar remarks apply to the other complexity measure used by authors, LZ complexity.  Unlike the complexity measure discussed above, this one does implicitly put a structure on the input space (by fixing an enumeration of it, where the inputs are taken to be bit vectors, and the enumeration reads them off in binary).

"Simple" functions in the LZ sense are thus ones that respond to binary vectors in (roughly) a predictable way,.  What does it mean for a function to respond to binary vectors in a predictable way?  It means that knowing the values of some of the bits provides information about the output, even if you don't know all of them.  But since our models are encoding the inputs as binary vectors, we are already setting them up to have properties like this.

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

I'm don't think this step makes sense:

Then we look at the scaling law chart you just provided us, and we look at those L-shaped indifference curves, and we think: OK, so a task which can't be done for less than 10e15 params is a task which requires 10e15 data points also.

In the picture, it looks like there's something special about having a 1:1 ratio of data to params.  But this is a coincidence due to the authors' choice of units.

They define "one data point" as "one token," which is fine.  But it seems equally defensible to define "one data point" as "what the model can process in one forward pass," which is ~1e3 tokens.  If the authors had chosen that definition in their paper, I would be showing you a picture that looked identical except with different numbers on the data axis, and you would conclude from the picture that the brain should have around 1e12 data points to match its 1e15 params!

To state the point generally, the functional form of the scaling law says nothing about the actual ratio D/N where the indifference curves have their cusps.  This depends on your choice of units.  And, even if we were careful to use the same units, this ratio could be vastly different for different systems, and people would still say the systems "have the same scaling law."  Scaling is about relationships between differences, not relationships between absolute magnitudes.

On the larger topic, I'm pessimistic about our ability to figure out how many parameters the brain has, and even more pessimistic about our ability to understand what a reasonable scale for "a data point" is.  This is mostly for "Could a Neuroscientist Understand a Microprocessor?"-type reasons.  I would be more interested in an argument that starts with upper/lower bounds that feel absurdly extreme but relatively certain, and then tries to understand if (even) these weak bounds imply anything interesting, rather than an argument that aims for an point estimate or a subjective distribution.

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Actually, I think I spoke too soon about the visualization... I don't think your image of L(D) and L(N) is quite right.

Here is what the actual visualization looks like.  More blue = lower loss, and I made it a contour plot so it's easy to see indifference curves of the loss.

In these coordinates, L(D) and L(N) are not really straight lines, but they are close to straight lines when we are far from the diagonal line:

  • If you look at the upper left region, the indifference curves are parallel to the vertical (N) axis.  That is, in this regime, N doesn't matter and loss is effectively a function of D alone.
    • This is L(D).
    • It looks like the color changes you see if you move horizontally through the upper left region.
  • Likewise, in the lower right region, D doesn't matter and loss depends on N alone.
    • This is L(N).
    • It looks like the color changes you see if you move vertically through the lower right region.

To restate my earlier claims... 

If either N or D is orders of magnitude larger than the other, then you get close to the same loss you would get from N ~ D ~ (whichever OOM is lower).  So, setting eg (N, D) = (1e15, 1e12) would be sort of a waste of N, achieving only slightly lower loss than (N, D) = (1e12, 1e12).

This is what motives the heuristic that you scale D with N, to stay on the diagonal line.

On the other hand, if your goal is to reach some target loss and you have resource constraints, what matters is whichever resource constraint is more restrictive.  For example, if we were never able to scale D above 1e12, then we would be stuck achieving a loss similar to GPT-3, never reaching the darkest colors on the graph.

When I said that it's intuitive to think about L(D) and L(N), I mean that I care about which target losses we can reach.  And that's going to be set, more or less, by the highest N or the highest D we can reach, whichever is more restrictive.

Asking "what could we do with a N=1e15 model?" (or any other number) is kind of a weird question from the perspective of this plot.  It could mean either of two very different situations: either we are in the top right corner with N and D scaled together, hitting the bluest region ... or we are just near the top somewhere, in which case our loss is entirely determined by D and can be arbitrarily low.

In Ajeya's work, this question means "let's assume we're using an N=1e15 model, and then let's assume we actually need that many parameters, which must mean we want to reach the target losses in the upper right corner, and then let's figure out how big D has to be to get there."

So, the a priori choice of N=1e15 is driving the definition of sufficient performance, defined here as "the performance which you could only reach with N=1e15 params".

What feels weird to me -- which you touched on above -- is the way this lets the scaling relations "backset drive" the definition of sufficient quality for AGI.  Instead of saying we want to achieve some specific thing, then deducing we would need N=1e15 params to do it... we start with an unspecified goal and the postulate that we need N=1e15 params to reach it, and then derive the goal from there.

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

You can't have more D than you have compute, in some sense, because D isn't the amount of training examples you've collected, it's the amount you actually use to train... right? So... isn't this a heuristic for managing compute? It sure seemed like it was presented that way.

This is a subtle and confusing thing about the Kaplan et al papers.  (It's also the subject of my post that I linked earlier, so I recommend you check that out.)

There are two things in the papers that could be called "optimal compute budgeting" laws:

  • A law that assumes a sufficiently large dataset (ie effectively infinite dataset), and tell you how to manage the tradeoff between steps  and params .
  • The law we discussed above, that assumes a finite dataset, and then tells you how to manage its size  vs params .

I said the  vs  law was "not a heuristic for managing compute" because the  vs  law is more directly about compute, and is what the authors mean when they talk about compute optimal budgeting.

However, the  vs  law does tell you about how to spend compute in an indirect way, for the exact reason you say, that  is related to how long you train.  Comparing the two laws yields the "breakdown" or "kink point."

Do you agree or disagree? ... I take [you] to mean that you think the human brain could have had almost identical performance with much fewer synapses, since it has much more N than is appropriate given its D?

Sorry, why do you expect I disagree?  I think I agree.  But also, I'm not really claiming the scaling laws say or don't say anything about the brain, I'm just trying to clarify what they say about (specific kinds of) neural nets (on specific kinds of problems).  We have to first understand what they predict about neural nets before we can go on to ask whether those predictions generalize to explain some other area.

Perhaps it would help me if I could visualize it in two dimensions

This part is 100% qualitatively accurate, I think.  The one exception is that there are two "optimal compute" lines on the plot with different slopes, for the two laws referred to above.  But yeah, I'm saying we won't be on either of those lines, but on the L(N) or the L(D) line.

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance.


The scaling laws from the Kaplan et al papers do tell you this.

The relevant law is , for the early-stopped test loss given parameter count  and data size .  It has the functional form

with .

The result that you should scale  comes from trying to keep the two terms in this formula about the same size.

This is not exactly a heuristic for managing compute (since  is not dependent on compute, it's dependent on how much data you can source).  It's more like a heuristic for ensuring that your problem is the right level of difficulty to show off the power of this model size, as compared to smaller models.

You always can train models that are "too large" on datasets that are "too small" according to the heuristic, and they won't diverge or do poorly or anything.  They just won't improve much upon the results of smaller models.

In terms of the above, you are setting  and then asking what  ought to be.  If the heuristic gives you an answer that seems very high, that doesn't mean the model is "not as data efficient as you expected."  Rather, it means that you need a very large dataset if you want a good reason to push the parameter count up to  rather than using a smaller model to get almost identical performance.

I find it more intuitive to think about the following, both discussed in the papers:

  • , the  limit of 
    • meaning: the peak data efficiency possible with this model class
  • , the  limit of 
    • meaning: the scaling of loss with parameters when not data-constrained but still using early stopping

If the Kaplan et al scaling results are relevant for AGI, I expect one of these two limits to provide the relevant constraint, rather than a careful balance between  and  to ensure we are not in either limit.

Ultimately, we expect AGI to require some specific-if-unknown level of performance (ie crossing some loss threshold ).  Ajeya's approach essentially assumes that we'll cross this threshold at a particular value of , and then further assumes that this will happen in a regime where data and compute limitations are around the same order of magnitude.

I'm not sure why that ought to be true: it seems more likely that one side of the problem will become practically difficult to scale in proportion to the other, after a certain point, and we will essentially hug tight to either the  or the  curve until it hits .

See also my post here.

human psycholinguists: a critical appraisal

I wrote this post about a year ago.  It now strikes me as an interesting mixture of

  1. Ideas I still believe are true and important, and which are (still) not talked about enough
  2. Ideas that were plausible at the time, but are much less so now
  3. Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time

In category 1 (true, important, not talked about enough):

  • GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
  • Much scholarly ink has been spilled over questions of the form "what would it take, computationally, to do X?" -- where X is something GPT-2 can actually do.  Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
  • Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.

In category 2 (plausible then but not now):

  • "The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried."
    • I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
    • The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
    • The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
    • I now think transformers are just a "good default architecture" for our current compute regime and may not have special linguistic properties
  • I'm finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
    • I now think he's more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.

In category 3 (misleading):

  • I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
    • I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
    • ConvNets do bake in (a very simple kind of) prior knowledge.
    • But, though LSTMs and transformers are more "structured" than fully connected nets, the structure is not intended to encode prior knowledge.
    • Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
    • I was aware of the vast gap between "more structure than the literal minimum possible" and "the kind of structure Marcus wanted," but conflated the two.  Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.

In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.

Fourth Wave Covid Toy Modeling

Rt can go below one in Zvi's model.  It just takes an even higher rate of new infections.

Here's the same picture, with the horizontal axis extended so this is visible:

Of course, in the real world, Rt dips below one all the time, as you can see in the colored points.

As a dramatic example, Zvi's model is predicting the future forward from 12/23/20.  But a mere week before that date, Rt was below one!

Fourth Wave Covid Toy Modeling

Thanks!  This is exactly the kind of toy model I thought would help move these discussions forward.

The part I'm most suspicious of is the model of the control system.  I have written a Colab notebook exploring the issue in some detail, but briefly:

  • If you run the control system model on the past (2020), it vastly over-predicts R.
    • This is true even in the very recent past, when pandemic fatigue should have "set in."
  • Of course, by your assumptions, it should over-predict past R to some extent.  Because we now have pandemic fatigue, and didn't then.
  • However:
    • It seems better to first propose a model we know can match past data, and then add a tuning term/effect for "pandemic fatigue" for future prediction.
    • Because this model can't predict even the very recent past, it's not clear it models anything we have observed about pandemic fatigue (ie the observations leading us to think pandemic fatigue is happening).
    • Instead, it effectively assumes a discontinuity at 12/23/20, where a huge new pandemic fatigue effect turns on.  This effect only exists in the future; if it were turned on in the past, it would have swamped all other factors.

To get a sense of scale, here is one of the plots from my notebook:

The colored points show historical data on R vs. the 6-period average, with color indicating the date.

  • The first thing that stands out is that these two variables are not even approximately in a one-to-one relationship.
  • The second thing that stands out is that, if you were to fit some one-to-one relationship anyway, it would be very different from the toy model here.
  • Third thing: the toy model's baseline R is anchored to the "top of a hill" on a curve that has been oscillating quickly.  With an exponent of zero, it would stay stuck at the top of the recent hills, i.e. it would still over-predict the recent past.  (With a positive exponent, it shoots above those hills.)

More general commentary on the issue:

  • It seems like you are
    1. ... first, assuming that the control system sets R to infections
    2. ... then, observing that we still have R~1 (as always), despite a vast uptick in infections
    3. ... then, concluding that the control system has drastically changed all of a sudden, because that's the only way to preserve the assumption (1)
  • Whereas, it seems more natural to take (3) as evidence that (1) was wrong.

In other words, you are looking at a mostly constant R (with a slight sustained recent upswing), and concluding that this lack of a change is actually the result of two large changes that cancel out:

  1. Control dynamics that should make R go down
  2. A new discontinuity in control dynamics that conspires to exactly cancel #1, preserving a ~constant R

When R has been remarkably constant the whole time, I'm suspicious of introducing a sudden "blast" of large changes in opposing directions that net out to R still staying constant.  What evidence is there for this "blast"?

(The recent trajectory of R is not evidence for it, as discussed above: it's impossible to explain recent R with these forces in play.  They have to have have suddenly appeared, like a mean Christmas present.)

My model of the R/cases trends is something like:

  • "R is always ~1 with noise/oscillations"
  • "cases are exponential in R, so when the noise/oscillations conspire upwards for a while, cases blow up"

The missing piece is what sets the noise/oscillations, because if we can control that we can help.  However, any model of the noise/oscillations must calibrate them so it reproduces 2020's tight control around R~1.

This tight control was a surprise and is hard to reproduce in a model, but if our model doesn't reproduce it, we will go on being surprised by the same thing that surprised us before.

DALL-E by OpenAI

Very interesting!

The approach to images here is very different from Image GPT.  (Though this is not the first time OpenAI has written about this approach -- see the "Image VQ" results from the multi-modal scaling paper.)

In Image GPT, an image is represented as a 1D sequence of pixel colors.  The pixel colors are quantized to a palette of size 512, but still represent "raw colors" as opposed to anything more abstract.  Each token in the sequence represents 1 pixel.

In DALL-E, an image is represented as a 2D array of tokens from a latent code.  There are 8192 possible tokens.  Each token in the sequence represents "what's going on" in a roughly 8x8 pixel region (because they use 32x32 codes for 256x256 images).

(Caveat: The mappings from pixels-->tokens and tokens-->pixels are contextual, so a token can influence pixels outside "its" 8x8 region.)

This latent code is analogous to the BPE code used to represent tokens (generally words) for text GPT.  Like BPE, the code is defined before doing generative training, and is presumably fixed during generative training.  Like BPE, it chunks the "raw" signal (pixels here, characters in BPE) into larger, more meaningful units.

This is like a vocabulary of 8192 "image words."  DALL-E "writes" an 32x32 array of these image words, and then a separate network "decodes" this discrete array to a 256x256 array of pixel colors.

Intuitively, this feels closer than Image GPT to mimicking what text GPT does with text.  Pixels are way lower-level than words; 8x8 regions with contextual information feel closer to the level of words.

As with BPE, you get a head start over modeling the raw signal.  As with BPE, the chunking may ultimately be a limiting factor.  Although the chunking process here is differentiable (a neural auto-encoder), so it ought to be adaptable in a way BPE is not.

(Trivia: I'm amused that one of their visuals allows you to ask for images of triangular light bulbs -- the example Yudkowsky used in LOGI to illustrate the internal complexity of superficially atomic concepts.)

Load More