In a previous post I predicted that machine learning is limited by the algorithms it uses (as opposed to hardware or data). This has two big implications:

- Scaling up existing systems will not render human thinking obsolete.
- Replacing the multilayer perceptron with a fundamentally different algorithm could disrupt
^{[1]}the entire AI industry.

Today's artificial neural networks (ANNs) require far more data to learn a pattern than human beings. This is called the **sample efficiency problem**. Some people think the sample efficiency problem can be solved by pouring more data and compute into existing architectures. Other people (myself included) believe the sample efficiency problem must be solved algorithmically. In particular, I think the sample efficiency problem is related to the extrapolation problem

# Extrapolation and Interpolation

Machine learning models are trained from a dataset. We can think of this dataset as a set of points in vector space. Our training data is **bounded** which means it's confined to a finite region of vector space. This region is called the **interpolation zone**. The space outside the interpolation zone is called the **extrapolation zone**.

Consider an illustraion from this paper. Ground truth is black. Training data is green. An ANN's output trained on the training data is in blue.

A human being could look at the green training data and instantly deduce the pattern in black. The ANN performs great in the interpolation zone and completely breaks down in the extrapolation zone. The ANNs we use today can't generalize to the extrapolation zone.

One way around this problem is to paper over it with big data. I agree that big data is a practical solution to many real-world problem. However, I do not believe it is a practical solution to all real-world problems. An algorithm that can't extrapolate is an algorithm that can't extrapolate. Period. No amount of data or compute you feed it will get it to extrapolate. Adding data just widens the interpolaton zone. If a problem requires extrapolation then an ANN can't do it―at least if you're using today's methods.

Under this definition of extrapolation, GPT-3 is bad at extrapolation. An example of extrapolation would be if GPT- could invent technology in advance of training data (including prompts) from before the technology was invented. For example, if GPT- could produce schematics for a transistor using only training data from 1930 or earlier then my prediction would be falsified. This is theoretically possible because the field-effect transistor was proposed in 1926 and the Schrödinger equation was actually published in 1926. Feeding GPT- the correct values of universal constants is allowed. GPT- would have access to much more compute than was available when the transistor was actually invented in 1947. It'd just be a matter of doing the logic and math, which I predict GPT- will be forever incapable of.

Other architectures exist which are better at extrapolation.

# Neural Ordinary Differential Equations (ODEs)

My predictions on AI timelines were shortened by *Neural Ordinary Differential Equations*. The math is clever and elegant but what's really important is what it lets you do. Neural ODEs can extrapolate.

The Neural ODE paper was published in 2019. I predict the first person to successfully apply this technique to quantitative finance will become obscenely rich. If you become a billionaire from reading this article, please invite me to some of your parties.

I'm using "disrupt" the way Clayton Christensen does in his 1997 book

*The Innovator's Dilemma*. ↩︎

Your point about neural nets NEVER being able to extrapolate is wrong. NNs are universal function approximators. A sufficiently large NN with the right weights can therefore approximate the “extrapolation” function (or even approximate whatever extrapolative model you’re training in place of an NN). I usually wouldn’t bring up this sort of objection since actually learning the right weights is not guaranteed to be feasible with any algorithm and is pretty much assuming away the entire problem, but you said “An algorithm that can't extrapolate is an algorithm that can't extrapolate. Period.”, so I think it’s warranted.

Also, I’m pretty sure “extrapolation” is essentially Bayesian:

There’s nothing in there that NNs are fundamentally incapable of doing.

Finally, your reference to neural ODEs would be more convincing if you’d shown them reaching state if the art results in benchmarks after 2019. There are plenty of methods that do better than NNs when there’s limited data. The reason NNs remain so dominant is that they keep delivering better results as we throw more and more data at them.

I'm pretty sure this is wrong. The universal approximator theorems I've seen work by interpolating the function they are fitting; this doesn't gurantee that they can extrapolate.

In fact it seems to me that a universal approximator theorem can never show that a network is capable of extrapolating? Because the universal approximation has to hold for all possible functions, while extrapolation inherently involves guessing a specific function.

The post isn't talking about fundamental capabilities, it is talking about "today's methods".

I was thinking of the NN approximating the “extrapolate” function itself, that thing which takes in partial data and generates an extrapolation. That function is, by assumption, capable of extrapolation from incomplete data. Therefore, I expect a sufficiently precise approximation to that function is able to extrapolate.

It may be helpful to argue from Turing completeness instead. Transformers are Turing complete. If “extrapolation” is Turing computable, then there’s a transformer that implements extrapolation.

Also, the post is talking about what NNs are

everable to do, not just what they can do now. That’s why I thought it was appropriate to bring up theoretical computer science.I'm implying that the popular ANNs in use today have bad priors. Are you implying that a sufficiently large ANN has good priors or that it can learn good priors?

That they can learn good priors. That’s pretty much what I think happens with pretraining. Learn the prior distribution of data in a domain, then you can adapt that knowledge to many downstream tasks.

Also, I don’t think built-in priors are that helpful. CNNs have a strong locality prior, while transformers don’t. You’d think that would make CNNs much better at image processing. After all, a transformer’s prior is that the pixel at (0,0) is just as likely to relate to the pixel at (512, 512) as it is to relate to its neighboring pixel at (0,1). However, experiments have shown that transformers are competitive with state of the art, highly tuned CNNs. (here, here)

ML prof here.

Universal fn approx assumes bounded domain, which basically means it is about interpolation.

turns out DNNs also can't necessarily interpolate properly, even:

https://arxiv.org/pdf/2107.08221.pdf

What do you think of the point I brought up a few days ago about things like AlphaZero having infinitely large sample efficiency with respect to external data? If amplification improves datasets, it's not crucial for learning itself to improve at all. From this point of view, the examples of extrapolation you gave are just salient hypotheses already in the model, a matter of training data. You are familiar with sines and spirals, so you can fit them. If amplification lets an AI play with math and plots, while following aesthetic sense already in the underlying human (language) model, these hypotheses are eventually going to be noticed.

I've recently started thinking that extrapolation and things like consequentialism/agency should be thought of as unsafe amplifications, those that generate data out of distribution without due scrutiny by existing sensibilities of the system (which isn't possible for things too far out of distribution). So a possible AI safety maxim is to avoid extrapolation and optimization, the only optimization it's allowed is that of learning models from data, and only safe datasets should be learned from (external datasets might become safer after augmentation by the system that adds in its attitude to what it sees). Instead it should work on expanding the distribution "at the boundary" with amplifications heavy on channeling existing attitudes and light on optimization pressure, such as short sessions of reflection.