Review

The amount of compute required to emulate the human brain depends on the level of detail we want to emulate. 

Back in 2008, Sandberg and Bostrom proposed the following values:

Level of emulation detailFLOPS required to run the brain emulation in real-time
Analog network population model10^15 
Spiking neural network    10^18 
Electrophysiology  10^22  
Metabolome  10^25 
Proteome  10^26
States of protein complexes    10^27
Distribution of protein complexes   10^30
Stochastic behavior of single molecules  10^43

Today I've encountered an interesting piece of data on GPT-3 (source):

  • GPT-3 required ~10^15 FLOPS for inference.
  • It required ~10^23 FLOPS to train it [Note: the training took some months. It would require ~10^30 FLOPS to train it from zero in one second]

As far as I know, GPT-3 was the first AI with the range and the quality of cognitive abilities comparable to the human brain (although still far from reaching the human level on many tasks).

Coincidentally(?), GPT-3 requires 10^15 - 10^30 FLOPS to operate at the brain's speed, which is roughly the same amount of compute necessary to run a decent emulation of the human brain.

The range of possible compute is almost infinite (e.g. 10^100 FLOPS and beyond). Yet both intelligences are in the same relatively narrow range of 10^15 - 10^30 (assuming the human brain emulation doesn't need to be nano-level detailed). 

Is it a coincidence, or is there something deeper going on here?

This could be important for both understanding the human brain, and for predicting how far we are from the true AGI.

New Answer
New Comment

2 Answers sorted by

paulfchristiano

222

GPT-3 is about 2e11 parameters and uses about 4 flops per parameter per token, so about 1e12 flops per token.

If a human writes at 1 token per second, then you should be comparing 1e12 flops to the cost per second. I think you are implicitly comparing to the cost for a ~1000 token context?

I think 1e14 to 1e15 flops is a plausible estimate for the productive computation done by a human brain in a second, which is about 2-3 orders of magnitude beyond GPT-3.

I think this is not really a coincidence. GPT-3 is notable because it's starting to exhibit human-like abilities. It's not super surprising that should happen around human levels of compute, and I would personally expect the trend to continue as we scale up towards human level compute and continue improving deep learning efficiency. (I gave this about 50% probability in 2017 before seeing GPT-2, but I've updated significantly in favor over the last 6 years.)

More generally, I think the numbers in your post are wrong and the discussion is somewhat confused. 1e15 to 1e30 is not a narrow interval, I don't think you should compare training costs to inference costs, 1e30 is not the training cost of GPT-3, you should probably compare to brain compute estimates like this one rather than brain emulation estimates...

But I think it's reasonable to step back and say that compared to what you might have expected, biological anchors have been a pretty good guide to ML progress. They are losing usefulness now since at best they have like 10 years of resolution and eyeballing is getting easier and easier as we approach transformative AI. But I still find them helpful as an additional independent check to go along with eyeballing, economic extrapolations, etc. (And until recently I think they were probably the most common way people arrived at in-retrospect-reasonable-looking timeline estimates.)

Hi. Can you provide a citable reference for the "4 flops per parameter per token"? It's for a research paper in the foundations of quantum physics. Thanks. (Howard Wiseman.)

LawrenceC

163

Both the human brain cost estimates and the GPT3 cost estimates are incredibly noisy/mistaken and I wouldn’t take them too seriously.

To start, 15 orders of magnitude is not a narrow range at all!

For reference, the speed of light is within 8 orders of magnitude of a car, and an Elephant weighs within 6 OOM of a chicken — so this uncertainty is really big.

To be fair, I’ll note that the 10^30 estimate for GPT-3 is clearly an overestimate, the 3 x 10^23 floating point operations is the total compute used to train GPT-3, not it’s per-second usage (the unit is floating point operations, not floating point operations per second. Yes, the notation is confusing.)

I also think that the higher values of 10^27 and 10^30 seem pretty infeasible for brain simulation. But the lower numbers still seems feasible.

Another issue with your estimate is that it’s very bizarre to compare the total cost of training GPT vs the instantaneous operating cost of a human. Surely we want to compare like to like, and compare either instantaneous compute usage, or cumulative lifetime usage (which would multiply the human number by around 9 orders of magnitude).

A final confusion here is how to convert a forward pass to human thinking time. There’s some arguments that a forward pass is way more — you can’t read thousands of characters at once , for example — and some that it’s less — you can generate way more than literally one token at a time, and also do more impressive cognition.

So I think a better estimate for GPT3 cognition in real human equivalent time is something like 10^12 - 10^17 flops, while humans are 10^15 - 10^26 or whatever. This looks way less coincidental!

Most importantly, I think the obvious explanation applies here — to do the impressive kind of human seeming cognition GPT3 and it’s LLM brethren can do, using relatively straightforward methods like neural networks, takes a non-negligible fraction of the human brain’s compute.

7 comments, sorted by Click to highlight new comments since:

I worry that this is conflating two possible meanings of FLOPS:

  1. Floating Point Operations (FLOPs)
  2. Floating Point Operations per Second (Maybe FLOPs/s is clear?)

The AI and Memory Wall data is using (1) while the Sandberg / Bostrom paper is using (2) (see the definition in Appendix F).

(I noticed a type error when thinking about comparing real-time brain emulation vs training).

There was a recent post estimating that GTP-3 is equivalent to about 175 bees. There is also a comment there asserting that a human is about 140k bees.

I would be very interested if someone could explain where this huge discrepancy comes from. (One estimate is equating synapses with parameters, while this one is based on FLOPS. But there shouldn't be such a huge difference.)

The range of possible compute is almost infinite (e.g. 10^100 FLOPS and beyond). Yet both intelligences are in the same relatively narrow range of 10^15 - 10^30

10^15 - 10^30 is not at all a narrow range! So depending on what the 'real' answer is, there could be as little as zero discrepancy between the ratios implied by these two posts, or a huge amount. If we decide that GPT-3 uses 10^15 FLOPS (the inference amount) and meanwhile the first "decent" simulation of the human brain is the "Spiking neural network" (10^18 FLOPS according to the table), then the human-to-GPT ratio is 10^18 / 10^15 which is almost exactly 140k / 175. Whereas if you actually need the single molecules version of the brain (10^43 FLOPS), there's suddenly an extra factor of ten septillion lying around.

Author of the post here — I don’t think there’s a huge discrepancy here, 140k/175 is clearly within the range of uncertainty of the estimates here!

That being said the Bee post really shouldn’t be taken too seriously. 1 synapse is not exactly one float 16 or int8 parameter, etc

I didn't read that post. should I? is it more than a joke?

edit: I read it. it was a lot shorter than I expected, sometimes I'm a dumbass about reading posts and forget to check length. it's a really simple point, made in the first words, and I figured there would be more to it than that for some reason. there isn't.

I wouldn't call it a huge discrepancy. If both values are correct, it means the human brain requires only 10² - 10³ more compute than GPT-3. 

The difference could've been in dozens or even hundreds of OOMs, but it's only 2 - 3, which is quite interesting. Why the difference in compute is so small, if the nature of the two systems is so different?