This is a response to a comment made by Rohin Shah on Daniel Kokotajlo's post Fun with +12 OOMs of Compute. I started trying to answer some questions and assumptions he had, then realized there was more of an inferential gap that needed filling in. Also, as I attempted to estimate the OOMs of compute above GPT-3/PaLM needed for each method, I realized I was just going off of vague guesses rather than grounded estimates based on recent benchmarks. So, since other people might also be lacking the same info and be curious about my answer, I decided to put a bit more work into answering and turn it into a full post.

Introducing the cast

First, I'd like to note that I don't endorse trying to get to AGI through any of these methods. I think they are potentially worse for interpretability in addition to being less compute efficient. My goal here is to point out that I think it could be done if the world were suddenly given lots more compute. In other words, I shall make the argument that given lots of compute, issues of limited data and potential scaling plateaus of artificial neural nets can be bypassed via other less compute efficient methods. Many roads lead to AGI, and specific predictions about the failure of one specific path (e.g. Transformers) don't necessarily mean all the paths are affected by that predicted failure mode.

The main contenders

  1. Numenta - moderately detailed, most realistic-task proven. Designers made carefully chosen abstractions which may or may not be right.
  2. Spiking Neural Network simulators (spikingNNs) - [I mentioned Nengo previously because its the one I'm most familiar with, but researching this showed me that there are other better performing options with similar abstraction levels such as BindsNet and brian2genn ] moderately detailed, moderately compute efficient, moderately task proven. Fewer abstractions than Numenta, more than Blue Brain. Less chance that an important detail was omitted, but still some chance.
  3. Blue Brain - highly detailed, very compute inefficient, not task proven. Very few abstractions, so relatively high chance that it contains all necessary details for a functioning neocortex.

Supporting roles

The field of computational neuroscience has generated lots and lots of very narrowly focused models of particular subsets of lots of different brains. None of these is alone likely to turn into a full blown AGI if you throw compute at them, but they have useful additional details that could potentially get the main contenders unstuck from unexpected scaling plateaus.

Emulation

By brain emulation, I mean trying to make a model that captures some of the observed functions of brain circuits. These models vary widely in how much fidelity to fine details they strive for, versus a more abstracted approximation. More detail brings the risk that you got one of those details wrong, and also means potentially requiring exponentially more compute to scale. Less detail means more reliance on having made the correct abstractions.

Neuroscientists have a failure mode around trying to make too accurate and detailed of models. After all, if you've spent years of your life painstakingly measuring the tiny details, it can be really hard to swallow the idea that you might have to discard any of those details as irrelevant. I think Jan sums it up well in this comment:

Yes, I agree, a model can really push intuition to the next level! There is a failure mode where people just throw everything into a model and hope that the result will make sense. In my experience that just produces a mess, and you need some intuition for how to properly set up the model.

Each of the three contenders I mentioned have very different levels of detail and have chosen different abstractions.

What do these three main contenders have in common? A focus on the mammalian neocortex, the part of the brain that does the General Intelligence stuff, the part that humans have extra of. Neuroscience has lots of evidence showing that this is the critical part of the brain to emulate if you want a model that is able to reason abstractly about things. I won't go into depth here, but I will give you this quote from Numenta (see Jeff Hawkins' latest book for more depth, or this paper for a quick intro):

Old brain vs. new brain
A simple way to think about the brain is that it has two parts: the “old brain” and the “new brain.” The old brain, which comes from our reptilian ancestors and pre-dates dinosaurs, contains several different structures, such as the spinal cord and brainstem. It regulates your body (such as breathing), creates reflex behaviors (such as pulling your hand away from fire) and creates emotions (such as desire and anger). The new brain, or neocortex, is a single large organ. It sits on top of the old brain and is the brain’s analytical engine. It’s the part that can identify objects, learn a new language, or understand math.

Worth noting for each of these projects that their focus is on the neocortex. The Blue Brain project which talks about rodent brains is only a few well-understood parameter changes away from being a very accurate emulation of the human neocortex. They are careful not to do this because of the ethical implications of accurately simulating human neocortex tissue. I'm pretty confident from things that some of the project participants have said that they'd love to try simulating a whole human brain if given the compute and lack of oversight.

For example (emphasis mine) a quote from Rebasing I/O for Scientific Computing: Leveraging Storage Class Memory in an IBM BlueGene/Q Supercomputer by Schürmann et al 2014:

Combined with the large numbers of those entities, e.g. an estimated 200 million neurons and 1012 synapses for an entire rat brain [10], the resulting memory footprint is large and at the same time the algorithmic intensity low. With the human brain being an additional three orders of magnitude more complex, cellular models of the
human brain will occupy a daunting estimated 100PB of memory that will need
to be revisited by the solver at every time step.

Human cortical neuron properties are pretty well known in a lot of respects and are already able to be simulated on the Blue Brain system, they just are careful not to get hit by media hype/outrage by talking about large scale human neocortex experiments. An example of a small scale human cortical neuron experiment: https://live-papers.brainsimulation.eu/#2016-eyal-et-al

 

How much compute?

So I would argue that all of the main contenders are very training data efficient compared to artificial neural nets. I'm not going to go into detail on that argument, unless people let me know that that seems cruxy to them and they'd like more detail.

One of the things these contenders fall short on though is compute efficiency. For the sake of Daniel's thought experiment, I'd like to give some rough estimates on how much compute I think would be necessary to get a half-brain of compute for each of these. 

For artificial neural networks, the meaning of a 'neuron' or 'parameter' is less directly analogous to a neocortex neuron. For these emulations, the analogy holds together much better. The rough average number of neurons in the human neocortex is around 26 billion. So let's say 13 billion for the half-neocortex case.

Numenta training compute estimate

Ok, I just give up for now on finding benchmarks to accurately estimate this one. I give a rough guess at 'somewhere between the other two, closer to the Spiking Neural Nets'.

Here's the best summary I can give: they break the artificial neurons down into collections of artificial dendrites, which then have a very sparse activation and very sparse weights. This seems to help learn more from a given dataset, and to have an extended amount of information that can be 'fit' into the network without 'overwriting' previous info. The downside is that it's substantially less efficient to 'get' the information into the network in the first place. Like, it needs maybe 10x more epochs over the same dataset before it starts doing better than the feed forward multilayer perceptron was doing a while ago. But its learning doesn't plateau as soon, so it can eventually surpass the roughly-equivalent MLP.

 

Spiking Neural Net training compute estimate

my estimate: 3.82e24 flops

about 1 OOM over GPT-3

 less than an OOM over PaLM

For this category, I would add an additional OOM for the fact that the abstraction may be lossy/inefficient in capturing what actual brain neurons do. For instance, I noticed that the benchmark they were using in the papers had undershot the number of synapses for human pre-frontal cortex by an order of magnitude. Could be other things like that as well.

 Unlike Numenta, where the abstraction is very well thought out and I think it will either totally work or not, depending on whether they are as correct as they think they are about their abstraction. 

Or Blue Brain, where there is so much accuracy and so little abstraction I feel quite confident it'll work as expected on a emulated-neuron == real-neuron basis.

 

Blue Brain training compute estimate

my estimate:  2.37e33 FLOPs 

10 OOMs over GPT-3

9 OOMs over PaLM

 

from https://blog.heim.xyz/palm-training-cost/ :

ML ModelTraining Compute [FLOPs]x GPT-3
GPT-3 (2020)3.1e231x
Gopher (2021)6.3e232x
Chinchilla (2022)5.8e232x
PaLM (2022)2.5e2410x

 

Sources:

Numenta paper 1

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiqwPDF84_3AhUtEEQIHchvC2wQFnoECBYQAQ&url=https%3A%2F%2Fnumenta.com%2Fassets%2Fpdf%2Fresearch-publications%2Fpapers%2FSparsity-Enables-100x-Performance-Acceleration-Deep-Learning-Networks.pdf&usg=AOvVaw33dSHmz30T0fhBKWcfBMne

Using 8 bit compression of values via a unique mapping scheme, and running on FPGAs... hard to compare. Their mapping scheme pre-estimates the range of all variables, splits large numbers into lossy quantized representations spread across multiple 8 bit (INT8) numbers during encoding. So to get the equivalent of a FLOP, a floating point operation, you need to do several fixed-point 8 bit operations (FP-8bit-OPs). On average, maybe 4 FP-8bit-OPs per single precision FLOP? 

 

https://semiengineering.com/tops-memory-throughput-and-inference-efficiency/

What is TOPS? It means Trillions or Tera Operations per Second. It is primarily a measure of the maximum achievable throughput but not a measure of actual throughput. Most operations are MACs (multiply/accumulates), so TOPS = (number of MAC units) x (frequency of MAC operations) x 2

 

Alveo U250 datasheet says it gets 33.3 INT8 TOPs at peak.

 

rough guess of divide TOPs by 4 to get a terraFLOPs equivalent for Numenta's specific use case, based on studying their encoding.

= 8.325 pseudo-terraFLOPs =  8.325e9 psuedoFLOPs / second

 

 

? bio_seconds took ? wall clock seconds

 

flops / neuron

 flops / neurons =  flp/n

flp/n per bio_second

 flp/n / ? bio_second =  flp/n/s

 

So, for 1.3e9 neurons of the Cortex+Plasticity simulation type, for 15 bio_years of 'training time':

flops per second of biological time:

15 years of bio time need for training? = 3.154e7 sec/year * 15 years =  4.73e8 seconds of bio time

total compute needed for training = flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons =  flops

 

Numenta paper 2

Avoiding Catastrophe: Active Dendrites Enable Multi-Task Learning in Dynamic Environments

https://arxiv.org/abs/2201.00042 

separates out the neurons into collections of artificial dendrites in sparse matrices. Because it's not using FPGAs here, and doing task comparisons against standard multi-layer perceptron feed-forward networks, the compute is easier to compare. They give numbers for the estimated 'effective number of parameters' because the sparse nature of the networks means that the number of parameters looks huge but is effectively small for the amount of compute required to train and infer using them.  Several experiments are listed in the paper.

 

When employing the prototype method described in Section 4.2.1 to select context signals at test time only, we train an
Active Dendrites Network with 2 hidden layers that comprise Active Dendrites Neurons. For all training, we use the
Adam optimizer [Kingma and Ba, 2015] and a batch size of 256 samples. Table 3 gives the exact hyperparameters
and model architecture for each model we train and evaluate on permutedMNIST. Note that hyperparameters were
optimized indidually for each setting.
To combine Active Dendrites Network with SI, and to compare against XdG, we reduce the number of units in each
hidden layer from 2,048 to 2,000 as to exactly match the architectures (with the exception of dendritic segments)
used in the SI and XdG papers. (See Appendix for a discussion on the number of parameters.) In addition, the
SI-and-Active-Dendrites network is trained for 20 epochs per task instead of just 3 as this significantly improves results.
We fix the learning rate to be 5 × 10−4 for all numbers of tasks, and we use SI regularization strength c = 0.1 and
damping coefficient ξ = 0.1. Both a) training for 20 epochs per task and b) the c, ξ values that we use here align with
the training setups of Zenke et al. [2017] and Masse et al. [2018].

 

 

SpikingNN paper 1

https://www.sciencedirect.com/science/article/abs/pii/S0925231221003969

full text manuscript: https://www.sciencedirect.com/science/article/am/pii/S0925231221003969 

Ubuntu 18.04 LTS with Intel(R) Xeon(R)
CPU E5-2620 v4 @ 2.1 GHz and 32 GB RAM

 

SpikingNN paper 2

https://www.nature.com/articles/s41598-019-54957-7

For illustration we have used the data from the TITAN Xp card and Intel Core i9-7920X CPU

Caption for graph

Overview of the components that make up the total runtime of a simulation for the Mbody (left) and the COBAHH benchmark (right). The top panels show the time spent in the simulation itself which scales with the biological runtime of the model (shown at the right) and dominates the overall runtime for big networks and/or long simulations. Simulation times were measured for biological runtimes of 10 s (middle line), while the times for runs of 1 s (bottom line) and 100 s (top line) were extrapolated. The bottom panels show the time spent for code generation and compilation (blue), general overhead such as copying data between the CPU and the GPU (orange), and the time for synapse creation and the initialization of state variables before the start of the simulation (green). The details shown here are for single-precision simulations run on the Titan Xp GPU.

 

10 bio_seconds took 10^4 wall clock seconds

so 1 bio_second to 1000 wall clock seconds for 2.05e7 neurons

flops = cores  *   (cycles/second)  *  (flops/cycle)

flops = (1 node * 3840 cores)    *  ( 1.6e9 cycles / second)  *  ( 2 flops / cycle) *  1e3 seconds = 1.229e16

flops / neuron

 flops / 2.05e7 neurons =  6.14e6 flp/n

flp/n per bio_second

 flp/n / 1 bio_second = 6.14e6  flp/n/s

 

So, for 1.3e9 neurons of the Cortex+Plasticity simulation type, for 15 bio_years of 'training time':

https://en.wikipedia.org/wiki/FLOPS says 2 flops per cycle per core for single-precision simulations run on the Titan Xp GPU (3840 cores)

flops per second of biological time:

15 years of bio time need for training? = 3.154e7 sec/year * 15 years =  4.73e8 seconds of bio time

total compute needed for training = 6.14e6 flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons = 3.82e24 flops

 

 

https://github.com/BindsNET/bindsnet

 

Blue Brain paper 1

Large-Scale Simulation of Brain Tissue, Blue Brain Project, EPFL 

Technical Report for the ALCF Theta Early Science Program

Blue Brain paper 2

 CoreNEURON : An Optimized Compute Engine for the NEURON Simulator

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6763692/

From abstract:

We describe how CoreNEURON can be used as a library with NEURON and then compare performance of different network models on multiple architectures including IBM BlueGene/Q, Intel Skylake, Intel MIC and NVIDIA GPU.

From intro:

In the model of Markram et al. (2015) each neuron averages to about 20,000 differential equations to represent its electrophysiology and connectivity. To simulate the microcircuit of 31,000 neurons, it is necessary to solve over 600 million equations every 25 ms of biological time...

In general, this paper describes the journey to making the Blue Brain NEURON model more efficient and able to work with GPUs. And then doing benchmarking comparisons.

The benchmarking systems with hardware details, compiler toolchains and network fabrics are summarized in Table 3. The Blue Brain IV (BB4) and Blue Brain V (BB5) systems are based on IBM BlueGene/Q (Haring et al., 2012) and HPE SGI 8600 (Hewlett Packard Enterprise, 2019) platforms respectively, hosted at the Swiss National Computing Center (CSCS) in Lugano, Switzerland. The BB4 system has 4,096 nodes comprising 65,536 PowerPC A2 cores. The BB5 system has three different compute nodes: Intel KNLs with low clock rate but high bandwidth MCDRAM, Intel Skylakes with high clock rate, and NVIDIA Volta GPUs. Vendor provided compilers and MPI libraries are used on both systems. The BB4 system is used for strong scaling benchmarks (see Figure 8) as it has a large core count compared to the BB5 system. All benchmarks were executed in pure MPI mode by pinning one MPI rank per core.

Strong scaling of CoreNEURON on the BB4 system (BlueGene/Q IBM PowerPC A2, 16 cores @ 1.6 GHz, 16 GB DRAM ) for two large scale models listed in Table 1: the Cortex+Plasticity model with 219 k neurons. [nathan note: blue line is actual measurement, black line is theoretical optimum]

 

Relevant part of the Table 1 discussed above:

Model nameSummary#Neurons#Compartments#Synapses
Cortex + PlasticitySomatosensory cortex model with synaptic plasticity2.19e59.95e78.72e8

Note: one major parameter change in human neocortex vs rodent is that human neocortex has more synaptic connections per number of neurons. This hurts scaling somewhat because of the additional complexity. Not able to give a precise estimate for this additional compute based on the data I've found so far on their work. My guess is somewhat less than 2 OOMs extra cost in worst case.

 

Note for anyone trying to read this paper: a comprehension-gotcha is that they confusingly talk about both 'compute nodes' (the computers or virtual computers used), and 'neuron nodes' (the component parts of a neuron which are each individually simulated each timestep) using just the term 'nodes'. You have to keep the context of the paragraph straight to know which one they mean at any given time.

 

So, from these two papers, although they don't quite lay out all the parameters together in an easy-to-interpret way...

 

bbp paper1: 27 seconds of compute time for 0.1 seconds of biological time for 1? neuron(s) on a single compute node? (GPU system)

flops per second of biological time:

 

bbp paper2: 2.19e5 rodent cortex neurons requires 2e3 seconds of 2048 nodes, each node 16 cores @ 1.6GHz for 0.001? seconds of biological time (abbr: bio_second). (supercomputer baseline, not GPU measurement)

 

flops = cores  *   (cycles/second)  *  (flops/cycle)

flops = (2048 nodes * 16 cores)    *  ( 1.6e9 cycles / second)  *  ( 8 flops / cycle) *  2e3 seconds = 8.39e17

flops / neuron

8.39e17 flops / 2.19e5 neurons =  3.83e12 flp/n

flp/n per bio_second

3.82e12 flp/n / 0.001 bio_second = 3.83e15 flp/n/s

 

So, for 1.3e9 neurons of the Cortex+Plasticity simulation type, for 15 bio_years of 'training time':

 

 

 

https://en.wikipedia.org/wiki/FLOPS says that IBM PowerPC A2 (Blue Gene/Q) gets 8 64bit flops per core per cycle

(The Blue Brain project was so named because it was designed in cooperation with IBM specifically to work with the Blue Gene supercomputer)

flops per second of biological time:

15 years of bio time need for training? = 3.154e7 sec/year * 15 years =  4.73e8 seconds of bio time

total compute needed for training = 3.82e15 flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons = 2.37e33 flops = 2.37e18 petaFLOPs

 

other Blue Brain papers:

 

In-Memory Compression for Neuroscience Applications - Bayly

https://github.com/DevinBayly/gsoc_report/blob/master/report.pdf

 

 

Reconstruction and Simulation of Neocortical Microcircuitry

https://www.cell.com/cell/fulltext/S0092-8674(15)01191-5 

 

 

 

 

 

 

 

 

 

Side note: Why half-brain? 

Because there are multiple sources of evidence for half a human brain being sufficient to instantiate a general reasoning agent. 

One of these is the case of hemispherectomy. People with severe seizures have had portions of their brain removed to stop the seizures. This operation can be as extreme as an entire hemisphere of the brain. If this happens in childhood while the brain connections are still highly plastic, then close-to-normal function can be regained.

Another case I know of involved a birth defect resulting in a missing hemisphere. 

And yet another way significant brain tissue loss can happen is an ischemic event (oxygen deprivation and sudden harmful return). This tends to be quite bad for older adults who commonly experience this via strokes, because the brain is set in its ways by then and has a hard time regaining enough plasticity to rewire around the damage. But if it happens to a child, (e.g. a partial drowning), recovery is usually quite good (depending on exactly which bits are affected).

I think you could make do with even less than 50% if you were thoughtful about what you cut. Maybe as little as 30%. That's not a necessary condition for this thought experiment though.

22

14 comments, sorted by Click to highlight new comments since: Today at 5:41 AM
New Comment

If I'm reading this correctly, you think Numenta is comparable to regular neural nets, and you think spiking neural nets are more compute efficient than Numenta? Do spiking neural nets actually outperform neural nets? Can we run this comparison?

(This seems so unlikely because I'd have expected people to use spiking neural nets instead of regular neural nets if this were true)

I think spiking neural nets are at least 1, probably more like 2 OOMs more compute intensive to train, similarly effective, somewhat more efficient at learning from data. I think Numenta is probably even harder to train and even more data efficient. I can certainly test these hypotheses at small scale. I'll let you know what I find.

If you think spiking neural nets are more compute intensive, then why does this matter? It seems like we'd just get AGI faster with regular neural nets? (I think compute is more likely to be the bottleneck than data, so the data efficiency doesn't seem that relevant.)

Perhaps you think that if we use spiking neural nets, then we only need to train it for 15 human-years-equivalent to get AGI (similar to the Lifetime anchor in the bio anchors report), but that wouldn't be true if we used regular neural nets? Seems kinda surprising.

Maybe you think that the Lifetime anchor in bio anchors is the best anchor to use and so you have shorter timelines?

I wrote this specifically aimed at the case of "in the thought experiment where humanity got 12 orders of magnitude more compute this year, what would happen in the next 12 months?" I liked the post that Daniel wrote about that and wanted to expand on it. My claim is that even if everything that was mentioned in that post was tried and failed, that there would still be these things to try. They are algorithms which already exist, which could be scaled up if we suddenly had an absurd amount of compute. Not all arguments about why standard approaches like Transformers fail also apply to these alternate approaches.

Right, but I don't yet understand what you predict happens. Let's say we got 12 OOMs of compute and tried these things. Do we now have AGI? I predict no.

Ah, gotcha. I predict yes, with quite high confidence (like 95%), for 12 OOMs and using the Blue Brain Project. The others I place only small confidence in (maybe 5% each). I really think the BBP has enough detail in its model to make something very like a human neocortex, and capable of being an AGI, if scaled up.

Thanks for doing this! This sort of discussion is fairly important to AGI timelines estimates, I think, because e.g. if we conclude that +12 OOMs would be 80% likely to work given todays ideas... etc. (BTW I think you linked to the wrong post at the beginning, I think you meant to link to this.)

I'm not convinced yet. Rohin makes some good objections below and then also it would help if you explained how these compute estimates convert into probability-that-it-would-work estimates. What are the ways you can think of that Blue Brain with +12 OOMs wouldn't work? Having enumerated those ways, how likely are each of them? And how likely is it that there are other ways you haven't thought of?

Yes, I'll fix that link [edit: fixed]. I have not yet thought hard about failure modes and probabilities for these cases. I can work on that and let you know what I come up with.

total compute needed for training = 3.82e12 flp/n/s * 4.78e8 bio_seconds * 1.3e9 neurons = 2.37e30 flops = 2.37e15 petaFLOPs

Should that be 3.82e15 flp/n/s, based on the numbers right above?

Yes, thanks for catching that.

 edit: fixed

So I would argue that all of the main contenders are very training data efficient compared to artificial neural nets. I'm not going to go into detail on that argument, unless people let me know that that seems cruxy to them and they'd like more detail.

I'm not sure I get this enough for it to even be a crux, but what's the intuition behind this?

My guess for your argument is that you see it as analogous to the way a CNN beats out a fully-connected one at image recognition, because it cuts down massively on the number of possible models, compatibly with the known structure of the problem.

But that raises the question, why are these biology-inspired networks more likely to be better representations of general intelligence than something like transformers? Genuinely curious what you'll say here.

(Wisdom of evolution only carries so much weight for me, because the human brain is under constraints like collocation of neurons that prevent evolution from building things that artificial architectures can do.)

I don't think they are better representations of general intelligence. I'm quite confident that much better representations of general intelligence exist and just have yet to be discovered. I'm just saying that these are closer to a proven path, and although they are inefficient and unwise, somebody would likely follow these paths if suddenly given huge amounts of compute this year. And in that imaginary scenario, I predict they'd be pretty effective.

My reasoning for saying this for the Blue Brain Project is that I've read a lot of their research papers, and understand their methodology pretty well, and I believe they've got really good coverage of a lot of details. I'm like 97% confident that whatever 'special sauce' allows the human brain to be an effective general intelligence, BBP has already captured that in their model. I think they've captured every detail they could justify as being possibly slightly important, so I think they've also captured a lot of unessecary detail. I think this is bad for interpretability and compute efficiency. I don't recommend this path, I just believe it fulfills the requisites of the thought experiment on 12 OOMs of compute magically appearing.

What's the difference between flp and flop? Or is that a typo / abbreviation?

How do your numbers compare to the numbers in Joe Carlsmith's report? For example, the number I've had in my head comes from that report, namely “a real-time human brain simulation might require something like 1e15 FLOP/s, plus or minus a few orders of magnitude”. (See here.)

I was using flp as an abbreviation. And I'll read Joe Carlsmith's report and then let you know what I think.

edit: Oh yeah, and one thing to keep in mind is these are estimates for if we suddenly had a shockingly big jump in amount of compute (12 orders of magnitude) but no time to develop or improve existing algorithms. So my estimates for 'what could a reasonably well engineered algorithm, that had been tested and iterated on at scale, do?' would be much much lower. This is stupidly wasteful upper bound.