Previously, I argued that emergent phenomena in machine learning mean that we can't rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across "phase transitions" caused by emergent behavior.

This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to think to determine the right generalization, and also about what might stop any given generalization from holding.

I don't think many people would contest the claim that empirical investigation can uncover deep and generalizable truths. This is one of the big lessons of physics, and while some might attribute physics' success to math instead of empiricism, I think it's clear that you need empirical data to point to the right mathematics.

However, just invoking physics isn't a good argument, because physical laws have fundamental symmetries that we shouldn't expect in machine learning. Moreover, we care specifically about findings that continue to hold up after some sort of emergent behavior (such as few-shot learning in the case of ML). So, to make my case, I'll start by considering examples in deep learning that have held up in this way. Since "modern" deep learning hasn't been around that long, I'll also look at examples from biology, a field that has been around for a relatively long time and where More Is Different is ubiquitous (see Appendix: More Is Different In Other Domains).

Empirical Generalization in Deep Learning

I'll consider three examples in deep learning: adversarial examples, data efficiency, and out-of-distribution generalization.

Adversarial examples. Adversarial examples were first discovered in 2013, a year after the AlexNet paper (which arguably marked the start of "modern" deep learning). Since then, there have been at least two qualitative changes in deep networks---pretraining to provide better inductive bias, and the emergence of few-shot learning---plus some smaller changes in architecture. As far as I know, adversarial examples affect every neural network model that exists. Moreover, the main (partial) remedy, adversarial training, is the same in every architecture and domain.

Data efficiency. Starting around 2016, there were papers showing that learned representations from pre-trained models were more data-efficient compared to randomly-initialized models. Moreover, it seemed that pre-training on more and better data increased data efficiency further. Taken to its logical extreme, this meant that with enough data you should be able to learn from very few examples--which is what's happened, for both fine-tuning and few-shot learning.

The findings above are in computer vision and NLP, but I'd bet that in pretty much any domain more unsupervised data will mean you need less supervised data, and that this trend will hold until you're close to information-theoretic limits (i.e. needing only a handful of examples). I also expect this to continue holding even after ML models gain some new emergent capability such as good long-term planning.

Out-of-distribution generalization. This one is a bit more fuzzy and qualitative, and is a prediction about the future rather than empirical evidence about the past. The question is: how will neural networks behave in "out-of-distribution" situations where the training data hasn't fully pinned down their behavior? On a spectrum from "completely randomly" (0) to "exactly as intended" (10), my current view is around an 8/10. Intuitively, neural networks "want" to generalize, and will make reasonable extrapolations as long as:

  • The in-distribution data is reasonably diverse
  • The in-distribution accuracy is high (for a binary task, something like 97% or higher).

In these cases I don't mean that they will always get good OOD accuracy. But I think the model will pick from some fairly low-dimensional space of "plausible" generalizations. A model trained only on images from the United States might be confused by French street signs, but it will mostly do so by either ignoring the text or substituting a perceptually similar American sign.

Another way of putting this is that in domains where a neural net is proficient, there is a relatively low-dimensional space of "possible" generalizations that the network might pick. This is intuitively consistent with the point above on data efficiency---since the possibility space is low-dimensional, it doesn't take too much data to identify the "right" generalization.

I expect this to continue to hold as neural nets become more powerful: concretely, as long as a model is proficient at a task, even fairly weak signals about how it “should” behave in a new situation will be enough to push it in the right direction.

How This Relates to Human-Aligned AI

Not only do I expect the trends above to robustly hold, I also think they are each important components for thinking about safe AI.

First, any strategy for making future ML systems safe either needs a solution to adversarial examples or needs to work in spite of them. I would also bet that any such solution will feature adversarial training as a major component. Now, maybe we didn't need the empirical data to conclude this, and it should have just been obvious a priori. But the papers introducing these concepts have thousands of citations each, so if these sorts of things are obvious a priori to you, then you could instantly become one of the most successful ML researchers. Unless you're Ian Goodfellow, I'm a bit skeptical.

Second, given the prevalence of reward functions learned from human feedback, we might be concerned that AI will learn to fool the human supervisors rather than doing what humans actually want. If "fool people" and "do what they want" were merely two blips within some limitless space of ways that a network might interpret its reward signal, then we'd face a pretty intractable problem: almost all points in the space of possible network behaviors would be bad (something other than “do what humans want”) and it would be hard to even locate the good solution.

However, I don't think this is the world we live in. Both data efficiency and "wanting" to generalize suggest that "do what humans actually want" is part of a fairly simple space of natural generalizations, and it just won't take that many additional bits of information to pick it out from this space. There's still a challenge, since deceptive alignment and other thought experiments imply that we can't get these bits from direct supervision of the network's outputs. But I think there's a good chance we can get those bits from good interpretability tools---better tools than what we have now, but ones that are within reach.

I'm not arguing that AI safety is somehow trivial or easy---the concerns I discussed above are not exhaustive, and even those will take perhaps tens of thousands of researcher-hours to address. My point is that empirical trends give you problem structure that you can leverage, which often takes a problem from "intractable" to "only 300,000 hours of work".[1]

Empirical Generalization in Biology

I'm claiming that empirical findings generalize "surprisingly far", but there haven't been that many orders of magnitude to generalize across in machine learning so far. So let's look at biology, where there are many cases of findings generalizing all the way from "bacteria" to "humans".

The phage group. A prime example of this was the phage group, a group of researchers who made many of the founding contributions of molecular biology. They chose to study bacteriophages (viruses that attack bacteria) as the simplest possible model system. Most biologists at the time studied more complex organisms, and some were skeptical that phages were complex enough to yield meaningful insights.

The phage group essentially bet that studying viruses and bacteria would yield insights that generalized to more complex organisms. That bet paid off--among other things, they discovered:

Later discoveries based on bacteria also generalized to more complex organisms. For instance, Jacques Monod uncovered the structure of genetic regulatory networks by studying the lac operon in E. coli.

Now, one might object that biology is not a good analogy for machine learning, because all life shares the same genetic ancestry and thus has commonalities that neural networks will not. I have some sympathy for this point, but I think it understates how non-obvious it was that studying bacteriophages would be a good idea. Empirical trends generalize far because there is some mechanism that causes them to do so, but that mechanism is often only obvious in hindsight. We'll probably come up with similarly "obvious" explanations for trends in deep learning, but only after we discover them.

Moreover, shared genetic ancestry isn't actually enough to imply consistent trends. Regulatory networks work slightly differently in bacteria and humans, and some bacteria and viruses have circular rather than linear genomes. Nevertheless, most of the essential findings remain intact, even though bacteria have 1 cell each and humans have 30 trillion.

What About Superintelligence?

An argument I sometimes hear is that empirics won't help when dealing with deceptive AI that is much smarter than humans, because it might intentionally change its feature representations to thwart interpretability techniques, and otherwise intentionally obscure the results of empirical measurements.

I agree that if you had such an AI, you wouldn't be able to rely on empirical measurements. But if you had such an AI, I think you'd just be fundamentally screwed. When trying to solve the problem of deceptive AI, I view the main challenge as not getting to this point in the first place. In the language of deceptive alignment, you aren’t trying to "fix" a deceptively aligned AI, you're trying to make sure the training dynamics steer you far away from ever getting one.

Overall, I think that both empirics and conceptual arguments will be necessary to make AI systems safe (both now and in the future). Both the Engineering and Philosophy mindsets are perceiving different pieces of the elephant. I hope this series can help bridge these mindsets and move us towards a synthesis that is better-prepared to answer the challenges of the future.


  1. A typical research group might input 15,000 hours of work per year (7.5 full-time researchers x 2,000 hours), so this would be 4 research groups working full-time for 5 years. ↩︎

46

New Comment