Epistemic status: grain of salt (but you can play around with the model yourself). There's lots of uncertainty in how many FLOP/s the brain can perform.

In informal debate, I've regularly heard people say something like, "oh but brains are so much more efficient than computers" (followed by a variant of "so we shouldn't worry about AGI yet"). Putting aside the weakly argued AGI skepticism, brains actually aren't all that much more efficient than computers (at least not in any way that matters).

The first problem is that these people are usually comparing the energy requirements of training large AI models to the power requirements of running the normal waking brain. These two things don't even have the same units.

The only fair comparison is between the trained model and the waking brain or between training the model and training the brain. Training the brain is called evolution, and evolution isn't particularly known for its efficiency.

Let's start with the easier comparison: a trained model vs. a trained brain. Joseph Carlsmith estimates that the brain delivers roughly  petaFLOP/s (= floating-point operations per second). If you eat a normal diet, you're expending roughly  J/FLOP.

Meanwhile, the supercomputer Fugaku delivers 450 petaFLOP/s at 3030 MW, which comes out to about  J/FLOP…. So I was wrong? Computers require almost  times more energy per FLOP than humans?

 

What this misses is an important practical point: supercomputers can tap pretty much directly into sunshine; human food calories are heavily-processed hand-me-downs. We outsource most of our digestion to nature and industry.

Even the most whole-foods-grow-your-own-garden vegan is 22-33 orders of magnitude less efficient at capturing calories from sunlight than your average device.[1] That's before animal products, industrial processing, or any of the other Joules it takes to run a modern human.

After this correction, humans and computers are about head-to-head in energy/FLOP, and it's only getting worse for us humans. The fact that the brain runs on so little actual juice suggests there's plenty of room left for us to explore specialized architectures, but it isn't the damning case many think it is. (We're already seeing early neuromorphic chips out-perform neurons' efficiency by four orders of magnitude.)

But what about training neural networks? Now that we know the energy costs per FLOP are about equal, all we have to do is compare FLOP required to evolve brains to the FLOP required to train AI models. Easy, right?

Here's how we'll estimate this:

  1. For a given, state-of-the-art NN (e.g., GPT-3, PaLM), determine how many FLOP/s it performs when running normally.
  2. Find a real-world brain which performs a similar number of FLOP/s.
  3. Determine how long that real-world brain took to evolve.
  4. Compare the number of FLOP performed during that period to the number of FLOP required to train the given AI.

Fortunately, we can piggyback off the great work done by Ajeya Cotra on forecasting "Transformative" AI. She calculates that GPT-3 performs about  FLOP/s,[2] or about as much as a bee.

Going off Wikipedia, social insects evolved only about 150 million years ago. That translates to between  and  FLOP. GPT-3, meanwhile, took about  FLOP. That means evolution is 15 to 22 orders of magnitude less efficient.

 

Now, you may have some objections. You may consider bees to be significantly more impressive than GPT-3. You may want to select a reference animal that evolved earlier in time. You may want to compare unadjusted energy needs. You may even point out the fact that the Chinchilla results suggest GPT-3 was "significantly undertrained".

Object all you want, and you still won't be able to explain away the >15 OOM gap between evolution and gradient descent. This is no competition.

What about other metrics besides energy and power? Consider that computers are about 10 million times faster than human brains. Or that if the human brain can store a petabyte of data, S3 can do so for about $20,000 (2022). Even FLOP for FLOP, supercomputers already underprice humans.[3] There's less and less for us to brag about it.

 

Brain are not magic. They're messy wetware, and hardware will catch up has caught up.

Postscript: brains actually might be magic. Carlsmith assigns less than 10% (but non-zero) possibility that the brain computes more than  FLOP/s. In this case, brains would currently still be vastly more efficient, and we'd have to update in favor of additional theoretical breakthroughs before AGI.

If we include this uncertainty in brain FLOP/s, the graph looks more like this:

 without false overconfidence. 

Mean: ~
Median: 830

Appendix

The graphs in this post were generated using Squiggle and the obsidian-squiggle plugin. Click here if you want to play around with the model yourself.

brainEnergyPerFlop = {
	humanBrainFlops = 15; // 10 to 23;	// Median 15; P(>21) < 10%
	humanBrainFracEnergy = 0.2;
	humanEnergyPerDay = 8000 to 10000; // Daily kJ consumption
	humanBrainPower = humanEnergyPerDay / (60 * 60 * 24); // kW
	humanBrainPower * 1000 / (10 ^ humanBrainFlops) // J / FLOP
}

supercomputerEnergyPerFlop = {
    // https://www.top500.org/system/179807/ 
	power = 25e6 to 30e6; // J
	flops = 450e15 to 550e15;
	power / flops
}

supercomputerEnergyPerFlop / brainEnergyPerFlop
humanFoodEfficiency = {
	photosynthesisEfficiency = 0.001 to 0.03
	trophicEfficiency = 0.1 to 0.15
	photosynthesisEfficiency * trophicEfficiency 
}

computerEfficiency = {
    solarEfficiency = 0.15 to 0.20
    transmissionEfficiency = 1 - (0.08 to .15)
    solarEfficiency * transmissionEfficiency
}

computerEfficiency / humanFoodEfficiency
evolution = {
    // Based on Ayeja Cotra's "Forecasting TAI with biological anchors"
    // All calculations are in log space.
	
	secInYear = log10(365 * 24 * 60 * 60);
	
	// We assume that the average ancestor pop. FLOP per year is ~constant.
	// cf. Humans at 10 to 20 FLOP/s & 7 to 10 population
	ancestorsAveragePop = uniform(19, 23); # Tomasik estimates ~1e21 nematodes
    ancestorsAverageBrainFlops = 2 to 6; // ~ C. elegans
	ancestorsFlopPerYear = ancestorsAveragePop + ancestorsAverageBrainFlops + secInYear;

	years = log10(850e6) // 1 billion years ago to 150 million years ago
	ancestorsFlopPerYear + years
}
humanLife$ = 1e6 to 10e6
humanBrainFlops = 1e15
humanBrain$PerFlops = humanLife$ / humanBrainFlops 

supercomputer$ = 1e9
supercomputerFlops = 450e15
supercomputer$PerFlop = supercomputer$ / supercomputerFlops


supercomputer$PerFlops/humanBrain$PerFlops
  1. ^

    Photosynthesis has an efficiency around 1%, and jumping up a trophic level means another order of magnitude drop. The most efficient solar panels have above 20% efficiency, and electricity transmission loss is around 10%.

  2. ^

    Technically, it's FLOP per "subjective second" — i.e., a second of equivalent natural thought. This can be faster or slower than "truth thought."

  3. ^

    Compare FEMA's value of a statistical life at $7.5 million to the $1 billion price tag of the Fukuga supercomputer, and we come out to the supercomputer being a fourth the cost per FLOP.

19

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 3:27 AM

It is quite blatant on what it does but this is pretty much statistics hacking.

I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).

For brain evolution analog I would include the brain metabolism of the computer scientists developing the next version of the model too.

I would not hold against the CPU the inefficiency of the solar panels. Likewise I don't see how it is fair to blame the brain on the inefficiency of the gut. In case we can blame the gut then we should compare how much the model causes its electricity supply to increase which for many is equal to 0.

I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).

If we are going to compare TCO, I would point out that for many scenarios, 'a single brain' is a wild underestimate of the necessary computation because you cannot train just a single brain to achieve the task.

For some tasks like ImageNet classification, sure, pretty much every human can do it and one only needs enough compute for a single brain lifetime; but for many tasks of considerable interest, you need to train many human brains, rejecting and filtering most of them along the way, to (temporarily) get a single human brain achieving the desired performance level. There is no known way around that need for brute force.

For example, how many people can become as good artists as Parti already is? One in a hundred, maybe? (Let's not even require them to have the same universality, we'll let them specialize in any style or medium...) How many people are as flexible and versatile as GPT-3 at writing? (Let's just consider poetry. Even while unable to rhyme, GPT-3 is better at writing poetry than almost everyone at your local highschool or college, so that's out of hundreds to thousands of humans. Go read student anthologies if you don't believe me.)

Or more pointedly, you need exactly 1 training run of MuZero to get a world champion-level Go/chess/shogi agent (and then some) the first time, like every time, neither further training nor R&D required; how many humans do you need to churn through to get a single Go world champion? Well, you need approximately 1.4 billion Chinese/South-Korean/Japanese people to produce a Ke Jie or Lee Sedol as an upper bound (total population), around a hundred million as an intermediate point (approximate magnitude of number of people who know how to play), and tens of thousands of people as a lower bound (being as conservative as possible, by only counting a few years of highly-motivated insei children enrollment in full-time Go schools with the explicit goal of becoming Go professionals and even world champ, almost all of whom will fail to do so because they just can't hack it). And then the world champ loses after 5-10 years, max, because they get old and rusty, and now you throw out all that compute, and churn again for a replacement; meanwhile, the MuZero agent remains as pristinely superior as the day it was trained, and costs you 0 FLOPS to 'train a replacement'. This all holds true for chess and shogi as well, just with smaller numbers. So, whether that's a >10,000× or >1,400,000,000× multiplication of the cost, it's non-trivial, and unflattering to human brain efficiency estimates.

So, if you really think we should only be comparing training budgets to single brain-lifetimes, then I would be curious to hear how you think you can train a human Go world champ who can be useful for as long as a DL world champ, using only a single brain-lifetime.

Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millenia evolutionary timescales.

On that "country level" we should also consider for the model hyperparameter tuning and such. It is not super decisive but it is not like we precommit to use that single run if it is not maximally competent on what we think the method can provide.

Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television. It might be unethical to set a scenario to test for single task performance of human brains. "Do go really well and a passable job at stereoscopic 3d vision" is a different task than just "Do go really well". Humans being able to do ImageNet classfications without knowing to prepare for that specific task is quite a lot more than just having the capability. In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.

Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the "miss rate" based on trying to gather the cream of the cream doesn't really tell that it would be a generally unreliable method.

Considering what a country can do a in a decade does make sense. But it is still relatively close compared to multiple millennia evolutionary timescales.

I'm not sure what you mean here. If you want to incorporate all of the evolution before that into that multiplier of '1.4 billion', so it's thousands of times that, that doesn't make human brains look any more efficient.

Humans produce go professionals as a side product or one mode of answering the question of life. Even quite strict go professionals do stuff like prepare meals, file taxes and watch television.

All of those are costs and disadvantages to the debit of human Go FLOPS budgets; not credits or advantages.

On that "country level" we should also consider for the model hyperparameter tuning and such.

Sure, but that is a fixed cost which is now in the past, and need never be done again. The MuZero code is written, and the hyperparameters are done. They are amortized over every year that the MuZero trained model exists, so as humans turn over at the same cost every era, the DL R&D cost approaches zero and becomes irrelevant. (Not that it was ever all that large, since the total compute budget for such research tends to be more like 10-100x the final training run, and can be <1x in scaling research where one pilots tiny models before the final training run: T5 or GPT-3 did that. So, irrelevant compared to the factors we are talking about like >>10,000x.)

"Do go really well and a passable job at stereoscopic 3d vision" is a different task than just "Do go really well".

But not one that anyone has set, or paid for, or cares even the slightest about whether Lee Sedol can see stereoscopic 3D images.

Humans being able to do ImageNet classifications without knowing to prepare for that specific task is quite a lot more than just having the capability.

I think you are greatly overrating human knowledge of the 117 dog breeds in ImageNet, and in any case, zero-shot ImageNet is pretty good these days.

In contrast most models get an environment or data that is very pointedly shaped/helpful for their target task.

Again, a machine advantage and a human disadvantage.

Human filtering is also pretty much calibrated on human ability levels ie a good painter is a good human painter. Thus the "miss rate" based on trying to gather the cream of the cream doesn't really tell that it would be a generally unreliable method.

I don't know what you mean by this. The machines either do or do not pass the thresholds that varying numbers of humans fail to pass; of course you can have floor effects where the tasks are so easy that every human and machine can do it, and so there is no human penalty multiplier, but there are many tasks of considerable interest where that is obviously not the case and the human inefficiency is truly exorbitant and left out of your analysis. Chess, Go, Shogi, poetry, painting, these are all tasks that exist, and there are more, and will be more.

For example, how many people can become as good artists as Parti already is? One in a hundred, maybe?

Are you actually talking about DALLE-2?

No. DALL-E 2 is not SOTA, so no point in citing some old system from almost half a year ago as the example.

It is quite blatant on what it does but this is pretty much statistics hacking.

Like I said, there's plenty of uncertainty in FLOP/s. Maybe it's helpful if rephrase this as an invitation for everyone to make their own modifications to the model

I would compare a model being trained to computations that a single brain does over its lifetime to configure itself (or only restrict to childhood).

Cotra's lifetime anchor is  FLOPs (so 4-5 OOMs above gradient descent). That's still quite a chasm.

For brain evolution analog I would include the brain metabolism of the computer scientists developing the next version of the model too.

Do you mean including the CS brain activity towards the computed costs of training the model?

I would not hold against the CPU the inefficiency of the solar panels. Likewise I don't see how it is fair to blame the brain on the inefficiency of the gut. In case we can blame the gut then we should compare how much the model causes its electricity supply to increase which for many is equal to 0.

If you're asking yourself whether or not you want to automate a certain role, then a practical subquestion is how much you have to spend on maintenance/fuel (i.e., electricity or food)? Then, I do think the acknowledging the source of the fuel becomes important. 

Yes, I think that GPT-1 turning to GPT-2 and GPT-3 is the thing that is analogous with building brains out of new combinations of dna. Having an instance of GPT-3 to hone its weights and a single brain cutting and forming its connections are comparable. When doing fermi-estimates getting the ballpark wrong is pretty fatal as it is in the core of the activity. With that much conceptual confusion going on I don't care about the numbers. To claim that other are making mistakes and not surviving a cursory look does not bode well for convincingness. I don't care to get lured by pretty graphs to think my ignorance is more informed than it is.

If I know that the researches looks at the data until they find a correlation with p>0.05 that they found someting is not really significant news. Similarly if you keeping changing your viewpoint until you find an angle where orderings seem to reverse its less convincing that this one viewpoint is the one that matters.

Economically I would be interested in ability to change electricity to sugar and sugar to electricity. But because the end product is not the same the processes are not nearly economically interchangable. Go a long way in this direction and you measure everything in dollars. But typically when we care to specify that we care about energy efficiency and not example time efficiency we are going for more dimensions and more considerations rather than less.

To set terminology so that if gas prices go up then the energy efficiency of everything that uses gas goes down does not seem handy to me.

"oh but brains are so much more efficient than computers" (followed by a variant of "so we shouldn't worry about AGI yet"). Putting aside the weakly argued AGI skepticism, brains actually aren't all that much more efficient than computers (at least not in any way that matters).

I think a better counterargument is that if a computer running a human-brain-like algorithm consumes a whopping 10,000× more power than does a human brain, who cares? The electricity costs would still be below my local minimum wage!

Training the brain is called evolution

I argue here that a much better analogy is between training an ML model versus within-lifetime learning, i.e. multiply Joe Carlsmith’s FLOP/s estimates by roughly 1 billion seconds (≈31 years, or pick a different length of time as you wish) to get training FLOP. See the “Genome = ML code” analogy table in that post.

the supercomputer Fugaku delivers 450 petaFLOP/s at 3030 MW

I didn’t check just now, but I vaguely recall that there’s several (maybe 3??)-orders-of-magnitude difference between FLOP/J of a supercomputer versus FLOP/J of a GPU.

Watch out for FLOP/s (floating point operations per second) vs. FLOPs (floating point operations). I'm sorry for the source of confusion, but FLOPs usually reads better than FLOP.

I think that’s a bad tradeoff. FLOP reads just fine. Clear communication is more important!! :)

I think a better counterargument is that if a computer running a human-brain-like algorithm consumes a whopping 10,000× more power than does a human brain, who cares? The electricity costs would still be below my local minimum wage!

I agree (as counterargument to skepticism)! Right now though, "brains being much more efficient than computers" would update me towards "AGI is further away / more theoretical breakthroughs are needed". Would love to hear counterarguments to this model.

I argue here that a much better analogy is between training an ML model versus within-lifetime learning, i.e. multiply Joe Carlsmith’s FLOP/s estimates by roughly 1 billion seconds (≈31 years, or pick a different length of time as you wish) to get training FLOP. See the “Genome = ML code” analogy table in that post.

Great point. Copying from my response to Slider: "Cotra's lifetime anchor is  FLOPs (so 4-5 OOMs above gradient descent). That's still quite a chasm."

I didn’t check just now, but I vaguely recall that there’s several (maybe 3??)-orders-of-magnitude difference between FLOP/J of a supercomputer versus FLOP/J of a GPU.

This paper suggests 100 GFLOPs/W in 2020 (within an OOM of Fuguka). I don't know how much progress there's been in the last two years.

I think that’s a bad tradeoff. FLOP reads just fine. Clear communication is more important!! :)

Good point! I've updated the text.
 

Right now though, "brains being much more efficient than computers" would update me towards "AGI is further away / more theoretical breakthroughs are needed". Would love to hear counterarguments to this model.

I don’t understand why you would update that way, so I’m not sure how to counterargue.

For example, suppose that tomorrow Intel announced that they’ve had a breakthrough in carbon nanotube transistors or whatever, and therefore future generations of chips will be 10× more energy efficient per FLOP. If I understand correctly, on your model, you would see that announcement and say “Oh wow, carbon nanotube transistors, I guess now I should update to AGI is closer / fewer theoretical breakthroughs are needed.” Whereas on my model, that announcement is interesting but has a very indirect impact on what I should think and expect about AGI. Can you say more about where you’re coming from there?

This paper suggests 100 GFLOPs/W in 2020 (within an OOM of Fuguka). I don't know how much progress there's been in the last two years.

A100 datasheet says 624TFLOP/s/(250W) = J/FLOP. So ≈1 OOM lower than the supercomputer Fugaku. Good to know!

Let me take a slightly different example: echolocation.

Bats can detect differences in period as short as 10 nanoseconds. Neuronal spiking maxes out around 100 Hz. So the solution can't just be as simple as "throw more energy and compute at it". It's a question of "design clever circuitry that's as close as possible to theoretical limits on optimality". 

Similarly, the brain being very efficient increases the probability I assign to "it is doing something non-(practically-)isomorphic to feed-forward ANNs". Maybe it's hijacking recurrency in a way that scales far more effectively with parameter size than we can ever hope to create with transformers. 

But I notice I am confused and will continue to think on it. 

I think you pulled epistemic malfeasance in your arguments about energy.

 

The brain runs at around 20 watts for 11 PetaFLOP/s.

Fugaku runs at around 3030 MW for 450 PetaFLOP/s.

 

Your arguments about food processing seem completely irrelevant and basically cheating.

There are a lot of different ways you can talk about "efficiency" here. The main thing I am thinking about with regard to the key question "how much FLOP would we expect transformative AI to require?" is whether, when using a neural net anchor (not evolution) to add a 1-3 OOM penalty to FLOP needs due to 2022-AI systems being less sample efficient than humans (requiring more data to produce the same capabilities) and with this penalty decreasing over time given expected algorithmic progress. The next question would be how much more efficient potential AI (e.g., 2100-AI not 2022-AI) could be given fundamentals of silicon vs. neurons, so we might know how much algorithmic progress could affect this.

I think it is pretty clear right now that 2022-AI is less sample efficient than humans. I think other forms of efficiency (e.g., power efficiency, efficiency of SGD vs. evolution) are less relevant to this.

I think it is pretty clear right now that 2022-AI is less sample efficient than humans. I think other forms of efficiency (e.g., power efficiency, efficiency of SGD vs. evolution) are less relevant to this.

To me this isn't clear. Yes, we're better one-shot learners, but I'd say the most likely explanation is that the human training set is larger and that much of that training set is hidden away in our evolutionary past.

It's one thing to estimate evolution FLOP (and as Nuño points out, even that is questionable). It strikes me as much more difficult (and even more dubious) to estimate the "number of samples" or "total training signal (bytes)" over one's lifetime / evolution.

Neat. I have some uncertainty about the evolutionary estimates you are relying on, per here. But neat.