Small Data

lsusr

Probabilistic reasoning starts with priors and then updates them based off of evidence. Artificial neural networks take this to the extreme. You start with deliberately weak priors, then update them with a tremendous quantity of data. I call this "big data".

In this article, I use "big data" to mean the opposite of "small data". By this, "big data" refers to situations with so much training data you can get away with weak priors. Autonomous cars are an example of big data. Financial derivatives trading is an example of small data.

The most powerful recent advances in machine learning, such as neural networks, all use big data. Machine learning is good at fields where data plentiful, such as in identifying photos of cats, or where data can be cheaply manufactured, such as in playing videogames. "Plentiful data" is a relative term. Specifically, it's a measurement of the quantity of training data relative to the size (complexity) of the search space.

Do you see the problem?

Physical reality is an upper bound on data collection. Even if "data" is just a number stored momentarily on a CPU's register there is a hard physical limit to how much we can process. In particular, our data will never scale faster than where $x$ is the diameter of our computer in its greatest spacetime dimension. $O (x^{4})$ is polynomial time.

Machine learning search spaces are often exponential or hyperexponential. If your search space is exponential and you collect data polynomially then your data is sparse. When you have sparse data, you must compensate with strong priors. Big data uses weak priors. Therefore big data approaches to machine learning cannot, in general, handle small data.

Statistical Bias

Past performance is no guarantee of future results.

Suppose you want to estimate the mean variance $σ^{2}$ of a Gaussian distribution. You could sample $n$ points and then compute the mean variance of them.

$σ = \sqrt{\frac{\sum_{i = 1}^{n} (x_{i} - ¯ x)^{2}}{n}}$

If you did you'd be wrong. In particular, you'd underestimate the mean variance by a factor of $\frac{n}{n - 1}$ . The equation for standard deviation $s$ corrects for this and uses $n - 1$ in the denominator.

$s = \sqrt{\frac{\sum_{i = 1}^{n} (x_{i} - ¯ x)^{2}}{n - 1}}$

An estimate of the variance of a Gaussian distribution based solely on historical data, without adjusting for statistical bias bias will underestimate the mean variance of the underlying distribution.

$s^{2} = \frac{n}{n - 1} σ^{2}$

Underestimating mean variance by a factor of $\frac{n}{n - 1}$ can be solved by throwing training data at the problem because a factor of $\frac{n}{n - 1}$ vanishes as $n$ approaches infinity. Other learning environments are not so kind.

Divergent Series

Big data uses weak priors. Correcting for bias is a prior. Big data approaches to machine learning therefore have no built-in method of correcting for bias^[1]. Big data thus assumes that historical data is representative of future data.

To state this more precisely, suppose that we are dealing with a variable $x_{t}$ where $t \in Z^{+} = [1, 2, \dots, \infty)$ . In order to predict $x = {lim}_{t \to \infty} x_{t}$ from past performance $\frac{\sum_{i = 1}^{n} x_{t}}{n}$ , it must be true that such a limit ${lim}_{t \to \infty} x_{t}$ exists.

Sometimes no such limit exists. Suppose $x_{t}$ equals 1 for all positive integers whose most significant digit (in decimal representation) is odd and 0 for all positive integers whose most significant digit (in decimal representation) is even.

$x_{t} = {\begin{matrix} 1 & if & M S D (t) \notin 2 Z^{+} 0 & if & M S D (t) \in 2 Z^{+} \end{matrix}$

Suppose we want to predict the probability that an integer's first significant digit is odd.

The average ${lim}_{n \to \infty} \frac{\sum_{i = 1}^{n} x_{t}}{n}$ never converges. The average oscillates from ½ up to just over ¾ and back. You cannot solve this problem by minimizing your error over historical data. Insofar as big data minimizes an algorithm's error over historical results, domains like this will be forever out-of-bounds to it.

Big data compensates for weak priors by minimizing an algorithm's error over historical results. Insofar as this is true, big data cannot reason about small data.

Small Data

Yet, human beings can predict once-per-century events. Few of us can do it, but it can be done. How?

Transfer learning. Human beings use a problem's context to influence our priors.

So can we just feed all of the Internet into a big data system to create a general-purpose machine learning algorithm? No. Because when you feed in arbitrary data it's not just the data the increases in dimensionality. Your search space of relationships between input data increases even faster. Whenever a human being decides what data to feed into an artificial neural network, we are implicitly passing on our own priors about what constitutes relevant context. This division of labor between human and machine has enabled recent developments in machine learning like self-driving cars.

To remove the human from the equation, we need a system that can accept arbitrary input data without human curation for relevance. The problem is that feeding "everything" into a machine is close to feeding "nothing" into a machine, like how a fully connected graph contains exactly as much information as a fully disconnected graph.

Similar, but not equal. Consider Einstein. He saw beauty in the universe and then created the most beautiful theory that fit a particular set of data.

Beauty

Consider the sequence ${1, 2, 3, \dots}$ . What comes next?

It could be ${1, 2, 3, 1, 2, 3, 1, 2, 3, \dots}$
It could be ${1, 2, 3, 4, 5, 6, 7, 8, 9, \dots}$
It could be ${1, 2, 3, 5, 8, 13, 21, 34, 55, \dots}$
It could be ${1, 2, 3, 4, I, d e c l a r e, a, t h u m b, w a r, 5, \dots}$

You could say the answer^[2] depends on one's priors. That wouldn't be wrong per se. But the word "priors" gets fuzzy around the corners when we're talking about transfer learning. It would be more precise to say this depends on your sense of "beauty".

The "right" answer is whichever one has minimal Kolmogorov complexity i.e. whichever sequence is described by the shortest computer program. But for sparse data, Kolmogorov complexity depends more on your choice of programming language than the actual data. It depends on the sense of beauty of whoever designed the your development environment.

The most important thing in a programming language is what libraries you have access to. If the Fibonacci sequence is a standard library function and the identity operator is not then the Fibonacci sequence has lower Kolmogorov complexity than the identity operator $y = x \forall x, y \in Z^{+}$ .

The library doesn't even have to be standard. Any scrap of code lying around will do. In this way, Kolmogorov complexity, as evaluated in your local environment, is a subjective definition of beauty.

This is a flexible definition of "beauty", as opposed to big data where "beauty" is hard-coded as the minimization of an error function over historical data.

Programming languages like Lisp let you program the language itself. System-defined macros are stored in the same hash table as user-defined macros. A small data system needs the same capability.

No algorithm without the freedom to self-alter its own error function can operate unsupervised on small data.

―Lsusr's Second Law of Artificial Intelligence

To transcend big data, a computer program must be able to alter its own definition of beauty.

Cross-validation corrects for overfitting. Cross-validation cannot fully eliminate statistical bias because the train and test datasets both constitute "historical data". ↩︎
The answer is ${1, 2, 3, 4, - 1, \frac{1}{2}, 2 π, \infty, “ a s i n g l e n o n c o m p u t a b l e n u m b e r ”, \dots}$ . ↩︎

“big data” refers to situations with so much training data you can get away with weak priors The most powerful recent advances in machine learning, such as neural networks, all use big data.

This is only partially true. Consider some image classification dataset, say MNIST or CIFAR10 or ImageNet. Consider some convolutional relu network architecture, say, conv2d -> relu -> conv2d -> relu -> conv2d -> relu -> conv2d -> relu -> fullyconnected with some chosen kernel sizes and numbers of channels. Consider some configuration of its weights $W_{CNN}$ . Now consider the multilayer perceptron architecture fullyconnected -> relu -> fullyconnected -> relu -> fullyconnected -> relu -> fullyconnected -> relu -> fullyconnected. Clearly, there exist hyperparameters of the multilayer perceptron (numbers of neurons in hidden layers) such that there exists a configuration $W_{M L P}$ of weights of the multilayer perceptron, such that the function implemented by the multilayer perceptron with $W_{M L P}$ is the same function as the function implemented by the convolutional architecture with $W_{CNN}$ . Therefore, the space of functions which can be implemented by the convolutional neural network (with fixed kernel sizes and channel counts) is a subset of the space of functions which can be implemented by the multilayer perceptron (with correctly chosen numbers of neurons). Therefore, training the convolutional relu network is updating on evidence and having a relatively strong prior, while training the multilayer perceptron is updating on evidence and having a relatively weak prior.

Experimentally, if you train the networks described above, the convolutional relu network will learn to classify images well or at least okay-ish. The multilayer perceptron will not learn to classify images well, its accuracy will be much worse. Therefore, the data is not enough to wash away the multilayer perceptron's prior, hence by your definition it can't be called big data. Here I must note that ImageNet is the biggest publically available data for training image classification, so if anything is big data, it should be.

Big data uses weak priors. Correcting for bias is a prior. Big data approaches to machine learning therefore have no built-in method of correcting for bias.

This looks like a formal argument, a demonstration or dialectics as Bacon would call it, which uses shabby definitions. I disagree with the conclusion, i.e. with the statement "modern machine learning approaches have no built-in method of correcting for bias". I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.

In your example with a non-converging sequence, I think you have a typo - there should be $MSD (t)$ rather than $MSD (x_{t})$ .

I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.

The conclusion of my post is that these fixes and techniques are ad-hoc because they are written by the programmer, not by the ML system itself. In other words, the creation of ad-hoc fixes and techniques is not automated.

For the longest time, I would have used the convolutional architecture as an example of one of the few human-engineered priors that was still necessary in large scale machine learning tasks.

But in 2021, the Vision Transformer paper included the following excerpt: When trained on mid-sized datasets such as ImageNet without strong regularization, these models yield modest accuracies of a few percentage points below ResNets of comparable size. This seemingly discouraging outcome may be expected: Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias.

Taking the above as a given is to say, maybe ImageNet really just wasn't big enough, despite it being the biggest publicly available dataset around at the time.

In your example with a non-converging sequence, I think you have a typo - there should be $M S D (t)$ rather than $M S D (x_{t})$

Fixed. Thank you for the correction.

I’m getting more and more interested in this sequence. Very applicable to what I’m doing: analyzing market data.

As we know and you mentioned, humans do learn from small data. We start with priors that are hopefully not too strong and go through the known processes of scientific discovery. NN do not have that meta process or any introspection (yet).

"You cannot solve this problem by minimizing your error over historical data. Insofar as big data minimizes an algorithm's error over historical results ... Big data compensates for weak priors by minimizing an algorithm's error over historical results. Insofar as this is true, big data cannot reason about small data."

NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.

Maybe there will be a evolutionary process where huge NNs are reduced to do inference at "the edge" that turns into human like learning after feedback from "the edge" is used to select and refine the best nets.

NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.

This is very important. I plan to follow up with another post about the necessary role of hypothesis amplification in AGI.

Edit: Done.

I came across this:

The New Dawn of AI: Federated Learning

" This [edge] update is then averaged with other user updates to improve the shared model."

I do not know how that is meant but when I hear the word "average" my alarms always sound.

Instead of a shared NN each device should get multiple slightly different NNs/weights and report back which set was worst/unfit and which best/fittest.
Each set/model is a hypothesis and the test in the world is a evolutionary/democratic falsification.
Those mutants who fail to satisfy the most customers are dropped.

NNs are a big data approach, tuned by gradient descent. Because NNs are a big data approach, every update is necessarily small (in the mathematical sense of first-order approximations). When updates are small like this, averaging is fine. Especially considering how most neural networks use sigmoid activation functions.

While this averaging approach can't solve small data problems, it is perfectly suitable to today's NN applications where things tend to be well-contained, without fat tails. This approach works fine within the traditional problem domain of neural networks.

Would it be a reasonable interpretation here to read "beauty" as in some sense the inverse of Shannon entropy?

Kolmogorov complexity depends more on your choice of programming language than the actual data.

This is an informal statement, but my interpretation of it renders it false. The choice of programming language effects the Kolmogorov complexity up to an additive constant. The actual data can have an arbitrarily large effect!

oh, this made me realize that sample complexity reduction is fundamentally forward looking, as the lossy compression is a tacit set of predictions about what is worthwhile. i.e. guesses about sensitivity of your model. Interesting.

“big data” refers to situations with so much training data you can get away with weak priors The most powerful recent advances in machine learning, such as neural networks, all use big data.

Big data uses weak priors. Correcting for bias is a prior. Big data approaches to machine learning therefore have no built-in method of correcting for bias.

In your example with a non-converging sequence, I think you have a typo - there should be $MSD (t)$ rather than $MSD (x_{t})$ .

I think in modern machine learning people are experimenting with various inductive biases and various ad-hoc fixes or techniques with help correcting for all kinds of biases.

For the longest time, I would have used the convolutional architecture as an example of one of the few human-engineered priors that was still necessary in large scale machine learning tasks.

Taking the above as a given is to say, maybe ImageNet really just wasn't big enough, despite it being the biggest publicly available dataset around at the time.

In your example with a non-converging sequence, I think you have a typo - there should be $M S D (t)$ rather than $M S D (x_{t})$

Fixed. Thank you for the correction.

I’m getting more and more interested in this sequence. Very applicable to what I’m doing: analyzing market data.

NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.

NN also do not reduce/idealize/simplify, explicitly generalize and then run the results as hypothesis forks. Or use priors to run checks (BS rejection / specificity). We do.

This is very important. I plan to follow up with another post about the necessary role of hypothesis amplification in AGI.

Edit: Done.

I came across this:

The New Dawn of AI: Federated Learning

" This [edge] update is then averaged with other user updates to improve the shared model."

I do not know how that is meant but when I hear the word "average" my alarms always sound.

Would it be a reasonable interpretation here to read "beauty" as in some sense the inverse of Shannon entropy?

Kolmogorov complexity depends more on your choice of programming language than the actual data.

27

Small Data

27

Statistical Bias

Divergent Series

Small Data

Beauty

27

27