How should DeepMind's Chinchilla revise our AI forecasts?

[-]jacob_cannell3y197

The main update is to undermine confidence in the generality and utility of these 'Scaling Laws'. It's clear that the current LLM transformer recipe does not scale to AGI: it is vastly too data-inefficient. Human brains are a proof of concept that it's possible to train systems using orders of magnitude less data while simultaneously reaching higher levels of performance on the key downstream linguistic tasks.

Also there is now mounting evidence that these LLMs trained on internet scale data are memorizing all kinds of test sets for many downstream tasks, a problem which only gets worse as you try to feed them ever more training data.

[-]Noosphere893y10

Also there is now mounting evidence that these LLMs trained on internet scale data are memorizing all kinds of test sets for many downstream tasks, a problem which only gets worse as you try to feed them ever more training data.

Not really? If we assume that they just memorize data without having intelligence, then their memory requirements would scale as N parameters, when instead we see a smaller constant for compression, which essentially requires actual intelligence rather than simply memorizing all that data.

[-]jacob_cannell3y20

I didn't say they were simply memorizing, it's more complex than that: would need to look at the parameter scaling compression ratio vs data similarity/repetition, and compare to simpler SOTA compressors. Regardless of whether it's 'true' memorization or not, exposure to downstream task test sets distorts evaluations (this is already a problem for humans where many answers are available on the internet, it's just much more of a problem for AI that actually digests the entire internet).

[-]ESRogs3y43

In a forthcoming report I will estimate how might change as $D$ increases. The report will enumerate different sources of text-based data (e.g. publicly-accessible internet text, private social media messages, human conversations, etc), and for each data-source the report will estimate the cost-per-token and the total availability of the data.

The analysis may be tricky to do, but I'd be particularly interested in seeing model-generated data included in this list. I suspect that in practice the way model-builders will get around the data limit is by generating (and curating) synthetic data.

(This doesn't have to involve the model just getting high on its own supply. If you build in an evaluation step before including generated data in the training set, then I'd bet you can effectively do AlphaZero-like IDA. I'm guessing that a lot of the action is going to be in figuring out how to set up the generation + evaluation algorithms.)

[-]ESRogs3y20

For instance, Facebook, Google, and Apple will be advantaged, whereas OpenAI, DeepMind, and EleutherAI will be disadvantaged.

Nitpick: Google owns DeepMind, so it doesn't seem right to list DM on the disadvantaged side.

[-]Cleo Nardo3y30

Google owns DeepMind, but it seems that there is little flow of information back and forth.

Example 1: GoogleBrain spent approximately $12M to train PaLM, and $9M was wasted on suboptimal training because DeepMind didn't share the Hoffman2022 results with them.

Example 2: I'm not a lawyer, but I think it would be illegal for Google to share any of its non-public data with DeepMind.

[-]arabaga3y21

To be precise, Alphabet owns DeepMind. Google and DeepMind are sister companies.

So it's possible for something to benefit Google without benefiting DeepMind, or vice versa.

[-]Maybe_a3y10

In 7th footnote, should be 5e9, not 5e6 (doesn't seem to impact reasoning qualitatively).

[-]Felix Hofstätter3y10

Very interesting, after reading chinchilla's wild implications I was hoping someone would write something like this!

If I understand point 6 correctly, then you are proposing that Hoffman's scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas required to improve data efficiency compared to if the available compute will continue to increase in the near to mid-term future. If the available data really becomes exhausted within the next few years, then improving the quality of models will be more dependend on such novel ideas under Hoffman's laws than under Kaplan's.

[-]Cleo Nardo3y42

To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters.

I'll give you an analogy:

Suppose your friend is running a marathon. You hear that at the halfway point she has a time of 1 hour 30 minutes. You think "okay I estimate she'll finish the race in 4 hours". Now you hear she has been running with her shoelaces untied. Should you increase or decrease your estimate?

Well, decrease. The time of 1:30 is more impressive if you learn her shoelaces were untied! It's plausible your friend will notice and tie up her shoelaces.

But note that if you didn't condition on the 1:30 information, then your estimate would increase if you learned her shoelaces were untied for the first half.

Now for Large Language Models:

Believing Kaplan's scaling laws, we figure that the performance of LLMs depended on the number of parameters. But maybe there's no room for improvement in $N$ -efficiency. LLMs aren't much more $N$ -inefficient than the human brain, which is our only reference-point for general intelligence. So we expect little algorithmic innovation. LLMs will only improve because $N$ and $D$ grows.

On the other hand, believing Hoffman's scaling laws, we figure that the performance of LLMs depended on $D$ the number of datapoints. But there is likely room for improvement in $D$ -efficiency. The brain is far more $D$ -inefficient than LLMs. So LLMs have been metaphorically running with their shoes untied. There is room for improvement. So we're less surprised by algorithmic innovation. LLMs will still improve because $N$ and $D$ grows, but this isn't the only path.

So Hoffman's scaling laws shorten our timeline estimates.

This is an important observation to grok. If you're already impressed by how an algorithm performs, and you learn that the algorithm has a flaw which would disadvantage it, then you should increase your estimate of future performance.

[-]Paradiddle3y21

This analogy is misleading because it pumps the intuition that we know how to generate the algorithmic innovations that would improve future performance, much as we know how to tie our shoelaces once we notice they are untied. This is not the case. Research programmes can and do stagnate for long periods because crucial insights are hard to come by and hard to implement correctly at scale. Predicting the timescale on which algorithmic innovations occur is a very different proposition from predicting the timescale on which it will be feasible to increase parameter count.

[-]Trinley Goldenberg3y20

This is an important observation to grok. If you're already impressed by how an algorithm performs, and you learn that the algorithm has a flaw which would disadvantage it, then you should increase your estimate of future performance.

It's not clear to me that this is the case. You have both found evidence that there are large increases available, AND evidence that there is one less large increase than previously. It seems to depend on your priors which way you should update about the expectance on finding future similar increases.

^{^}

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.

^{^}

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.

^{^}

GoogleBrain trained a LLM called "PaLM" with the following values:

$N = 5.40 e 11$
$D = 5.85 e 11$
$V = 23.5$

According to Eq 2, PaLM would've achieved a loss of 1.94. This is the same loss for a model with the following values:

$N = 4.53 e 10$
$D = 1.75 e 12$
$V = 22.9$

Note that $10^{22.9}$ is 25.1% of $10^{23.5}$ so 74.9% of the budget could've been saved.

Heim2022^[12] estimates the cost of training PaLM lies between $9M and $23M.

^{^}

Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B Nanjajjar, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Jean-Stanislas Denain. Parameter, compute and data trends in machine learning, 2021a. URL https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/.

^{^}

Jaime Sevilla, Pablo Villalobos, and Juan Felipe Cerón. Parameter counts in machine learning. https://epochai.org/blog/parameter-counts, 2021b. Accessed: 2022-08-17

^{^}

Ajeya Cotra. Forecasting TAI with biological anchors, Sep 2020. URL https://www.lesswrong.com/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines.

^{^}

What are $D_{human}$ and $D_{evolution}$ ?

Here are some upper-bound approximations:

D_{human}

= 100 000 words-per-day

$\times$ 2 tokens-per-word

$\times$ 365 days-per-year

$\times$ 50 years

= 5.48e6 tokens

D_{evolution}

= 100 000 words-per-day

$\times$ 2 tokens-per-word

$\times$ 365 days-per-year

$\times$ 2 million years

= 1.46e14 tokens

^{^}

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2020. URL https://arxiv.org/abs/2009.03300.

^{^}

Jan H. Kirchner, Logan Smith, Jacques Thibodeau, Kyle McDonell, and Laria Reynolds. Researching alignment research: Unsupervised analysis, 2022. URL https://arxiv.org/abs/2206.02841.

^{^}

nostalgebraist. chinchilla’s wild implications, Jul 2022. URL https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinchilla-s-wild-implications.

^{^}

I remember reading this point in a LessWrong comment, but I can't manage to find it.

^{^}

Lennart Heim. Estimating palm’s training cost, 2022. URL https://blog.heim.xyz/palm-training-cost/.

Eq 2	$L_{H} (N, D) = 1.69 + 403.4 \times N^{- 0.34} + 410.7 \times D^{- 0.28}$
Eq 3	$log N_{H ⋆} = 0.46 \times V + 0.123$
Eq 4	$log D_{H ⋆} = 0.54 \times V - 0.123$

Eq 5	$L_{K} (N, D) = {(1.54 e + 10 \times N^{- 0.738} + 1.8 e 13 \times D^{- 1})}^{0.103}$
Eq 6	$log N_{K ⋆} = 0.73 \times V - 5.06$
Eq 7	$log D_{K ⋆} = 0.27 \times V + 5.06$

Eq 8	$V = 0.834 \times T + 18.4$
Eq 9	$log N = 0.479 \times T + 8.67$
Eq 10	$log D = 0.356 \times T + 9.88$

Eq 11	$log N_{K ⋆} = 0.73 \times V - 5.06$
	$= 0.73 \times (0.834 \times T + 18.40) - 5.06$
	$= 0.609 \times T + 8.48$
Eq 12	$log D_{K ⋆} = 0.27 \times V + 5.06$
	$= 0.27 \times (0.834 \times T + 18.40) + 5.06$
	$= 0.225 \times T + 10.1$

Eq 13	$log N_{H ⋆} = 0.46 \times V + 0.123$
	$= 0.46 \times (0.834 \times T + 18.40) + 0.123$
	$= 0.384 \times T + 8.59$
Eq 14	$log D_{H ⋆} = 0.54 \times V - 0.123$
	$= 0.54 \times (0.834 \times T + 18.40) - 0.123$
	$= 0.450 \times T + 9.813$

LESSWRONG
LW

LESSWRONG
LW

35

How should DeepMind's Chinchilla revise our AI forecasts?

35

35

Introduction.

The purpose of this report.

What problem do I aim to solve?

My contributions.

DeepMind's Chinchilla

Why care about scaling laws?

Hoffman's Scaling Laws.

Kaplan's Scaling Laws.

How were recent LLMs trained?

Recent LLMs were neither Kaplan-optimal nor Hoffman-optimal.

Sevilla-extrapolated model

Counterfactual histories.

Hoffman-receptive model.

How does Chinchilla revise our forecasts?

1. Shorter timelines via low-hanging fruit.

2. Maybe longer timelines?

3. Quicker exhaustion of data.

4. LLMs are "intelligent".

5. Biological Anchors model.

6. Shorter timelines via algorithmic innovation.

7. Data-rich tasks will be solved sooner than data-poor tasks.

8. The "Big Data" advantage

Open questions

What is the marginal cost of collecting data?

Unpublished scaling laws