I commented at the time (1,2,3) in the form of skepticism about the usefulness of the "Genome Anchor" section of the report. Later I fleshed out those thoughts in my post Against Evolution as an Analogy for how Humans Will Create AGI, see especially the "genome=code" analogy table near the top.
In this post I want to talk about a different section of the report: the "Lifetime Anchor".
1. Assumptions for this post
Here are some assumptions. I don’t exactly believe them—let alone with 100% confidence—but for the purpose of this post let’s say I do. I’m not going to present any evidence for or against them here. Think of it as the Jeff Hawkins perspective or something.
ASSUMPTION 1: There’s a “secret sauce” of human intelligence, and it looks like a learning algorithm (and associated inference algorithm).
ASSUMPTION 2: It’s a fundamentally different learning algorithm from deep neural networks. I don’t just mean a different neural network architecture, regularizer, etc. I mean really different, like “involving probabilistic program inference algorithms” or whatever.
ASSUMPTION 3: The algorithm is human-legible, but nobody knows how it works yet.
ASSUMPTION 4: We'll eventually figure out this “secret sauce” and get Transformative AI (TAI). [Note added for clarification: To simplify the discussion, I'm assuming that when this is all happening, we don't already have TAI independently via some unrelated R&D path.]
If you think these assumptions are all absolutely 100% wrong, well, I guess you might not find this post very interesting.
To be clear, Ajeya pretty much explicitly rejected these assumptions when writing her report (cf. discussion of “algorithmic breakthroughs” here), so there's no surprise that I wind up disagreeing with what she wrote. Maybe I shouldn't even be using the word "disagree" in this post. Oh well; her report is still a good starting point / foil for present purposes.
2. Thesis and outline
I will argue that under those assumptions, once we understand that “secret sauce”, it’s plausible that we will then be <10 years away from optimized, tested, well-understood, widely-used, industrial-scale systems for training these models all the way to TAI.
I’ll also argue that training these models from scratch will plausibly be easily affordable, as in <$10M—i.e., a massive hardware overhang.
(By “plausible” I mean >25% probability I guess? Sorry, I’m not at the point where I can offer a probability distribution that isn’t pulled out of my ass.)
Outline of the rest of this post: First I’ll summarize and respond to Ajeya’s discussion of the “Lifetime Anchor” (which is not exactly the scenario I’m talking about here, but close). Then I’ll talk (somewhat speculatively) about time and cost involved in refactoring and optimizing and parallelizing and hardware-accelerating and scaling the new algorithm, and in doing training runs.
3. Background: The “Lifetime Anchor” in Ajeya Cotra's draft report
In Ajeya's draft report, one of the four bases for estimating TAI timelines is the so-called “Lifetime Anchor”.
She put it in the report but puts very little stock in it: she only gives it 5% weight.
What is the “Lifetime Anchor”? Ajeya starts by estimating that simulating a brain from birth to adulthood would involve a median estimate of 1e24 floating-point operations (FLOP). This comes from 1e24 FLOP ≈ 1e15 FLOP/s × 30 years, with the former being roughly the median estimate in Joe Carlsmith’s report, and 30 years being roughly human adulthood (and rounds to a nice even 1e9 seconds). Actually, she uses the term “30 subjective years” to convey the idea that if we do a 10×-sped-up simulation of the brain, then the same training would take 3 years of wall-clock time, for example.
A 1e24 FLOP computation would cost about $10M in 2019, she says, and existing ML projects (like training AlphaStar at 1e23 FLOP) are already kinda in that ballpark. So 1e24 FLOP is ridiculously cheap for a transformative world-changing AI. (Memory requirements are also relevant, but I don’t think they change that picture, see footnote.)
OK, so far she has a probability distribution centered at 1e24 FLOP, proportional to the distribution she derived from Joe Carlsmith’s report. She then multiplies by a, let’s call it, “computer-vs-brain inefficiency factor” that she represents as a distribution centered at 1000. (I’ll get back to that.) Then there’s one more step of ruling out extremely-low-compute scenarios. (She rules them out for reasons that wouldn't apply to the scenario of Section 1 that I'm talking about here.) She combines this with estimates of investment and incremental algorithmic improvements and Moore's law and so on, and she winds up with a probability distribution for what year we'll get TAI. That's her “lifetime anchor”.
4. Why Ajeya puts very little weight on the Lifetime Anchor, and why I disagree
Ajeya cites two reasons she doesn’t like the lifetime anchor.
First, it doesn’t seem compatible with the empirical model size and training estimates for current deep neural networks:
I think the most plausible way for this hypothesis to be true would be if a) it turns out we need a smaller model than I previously assumed, e.g. ~1e11 or ~1e12 FLOP / subj sec with a similar number of parameters, and b) that model could be trained on a very short horizon ML problem, e.g. 1 to 10 seconds per data point. Condition a) seems quite unlikely to me because it implies our architectures are much more efficient than brain architectures discovered by natural selection; I don’t think we have strong reason to expect this on priors and it doesn’t seem consistent with evidence from other technological domains. Condition b) seems somewhat unlikely to me because it seems likely by default that transformative ML problems have naturally long horizon lengths because we may need to select for abilities that evolution optimized for, and possible measures to get around that may or may not work.
Why I disagree: As in Section 1, the premise of this post is that the human brain algorithm is a fundamentally different type of learning algorithm than a deep neural network. Thus I see no reason to expect that they would have the same scaling laws for model size, training data, etc.
Second, the implication is that training TAI is so inexpensive that we could have been doing it years ago. As she writes:
Another major reason for skepticism is that (even with a median ~3 OOM larger than the human lifetime) this hypothesis implies a substantial probability that we could have trained a transformative model using less computation than the amount used in the most compute intensive training run of 2019 (AlphaStar at ~1e23 FLOP), and a large probability that we could have done so by spending only a few OOMs more money (e.g. $30M to $1B). I consider this to be a major point of evidence against it, because there are many well-resourced companies who could have afforded this kind of investment already if it would produce a transformative model, and they have not done so. See below for the update I execute against it.
Why I disagree: Again as in Section 1, the premise of this post is that nobody knows how the algorithm works. People can’t use an algorithm that doesn’t yet exist.
5. Why Ajeya thinks the computer-vs-brain inefficiency factor should be >>1, and why I disagree
Ajeya mentions a few reasons she wants to center her computer-vs-brain-inefficiency-factor distribution at 1000. I won’t respond to all of these, since some would involve a deep-dive into neuroscience that I don’t want to get into here. But I can respond to a couple.
First, deep neural network data requirements:
Many models we are training currently already require orders of magnitude more data than a human sees in one lifetime.
Why I disagree: Again under the assumptions of Section 1, “many models we are training” are very different from human brain learning algorithms. Presumably human brain-like learning algorithms will have similar sample efficiency to actual human brain learning algorithms, for obvious reasons.
Second, she makes a reference-class argument using other comparisons between biological and human artifacts
Brain FLOP/s seems to me to be somewhat more analogous to “ongoing energy consumption of a biological artifact” while lifetime FLOP seems to be more analogous to “energy required to manufacture a biological artifact”; Paul’s brief investigation comparing human technologies to natural counterparts, which I discussed in Part 1, found that the manufacturing cost of human-created artifacts tend to be more like ~3-5 OOM worse than their natural counterparts, whereas energy consumption tends to be more like ~1-3 OOM worse.
Why I disagree: Ajeya mentions two reference class arguments here: (1) “human-vs-brain FLOP/s ratio” is hypothesized to fit into the reference class of “human-artifact-vs-biological-artifact ongoing energy consumption ratio”; and (2) “human-vs-brain lifetime FLOP” is hypothesized to fit into the reference class of “human-artifact-vs-biological-artifact manufacturing energy”.
Under my assumptions here, the sample efficiency of brains and silicon should be similar—i.e., if you run similar learning algorithms on similar data, you should get similarly-capable trained models at the end. So from this perspective, the two ratios have to agree—i.e., these are two reference classes for the very same quantity. That’s fine; in fact, Ajeya’s median estimate of 3 OOM is nicely centered between the ~1-3 OOM reference class and the ~3-5 OOM reference class.
But I actually want to reject both of those numbers, because I think Joe Carlsmith’s report has already “priced in” human inefficiency by translating from neuron-centric metrics (number of neurons, synapses etc.) to silicon-centric metrics (FLOPs). (And then we estimated costs based on known $/FLOP of human ML projects.) So when we talk about FLOPs, we’ve already crossed over into human-artifact-world! It would be double-counting to add extra OOMs for human inefficiency.
Here's another way to make this same point: think about energy usage. Joe Carlsmith’s report says we need (median) 1e15 FLOP/s to simulate a brain. Based on existing hardware (maybe 5e9 FLOP/joule? EDIT: …or maybe much lower; see comment), that implies (median) 200kW to simulate a brain. (Hey, $20/hour electricity bills, not bad!) Actual brains are maybe 20W, so we’re expecting our brain simulation to be about 4 OOM less energy-efficient than a brain. OK, fine.
…But now suppose I declare that in general, human artifacts are 3 OOM less efficient than biological artifacts. So we should really expect 4+3=7 OOM less energy efficiency, i.e. 200MW! I think you would say: that doesn’t make sense, it’s double-counting! That’s what I would say, anyway! And I’m suggesting that the above draft report excerpt is double-counting in an analogous way.
5.1 …And indeed why the computer-vs-brain inefficiency factor should be <<1!
My best guess for the inefficiency factor is actually <<1! (…At least, that’s my guess after a few years of people using these algorithms and picking the low-hanging fruit of implementing them efficiently.)
Why? Compare the following two possibilities:
- We understand the operating principles of the brain-like learning algorithms, and then implement those same learning algorithms on our silicon chips, versus
- We use our silicon chips to simulate biological neurons which in turn are running those brain-like learning algorithms.
Doing the second bullet point gets us an inefficiency factor of 1, by definition. But the second bullet point is bound to be far more inefficient than the first.
By analogy: If I want to multiply two numbers with my laptop, I can do it in nanoseconds directly, or I can do it dramatically slower by using my laptop to run a transistor-by-transistor simulation of a pocket calculator microcontroller chip.
Or here’s a more direct example: There’s a type of neuron circuit called a “central pattern generator”. (Fun topic by the way, see here.) A simple version might involve, for example, 30 neurons wired up in a particular way so as to send a wave of activation around and around in a loop forever. Let’s say (hypothetically) that this kind of simple central pattern generator is playing a role in an AGI-relevant algorithm. The second bullet point above would be like doing a simulation of those 30 neurons and all their interconnections. The first bullet point above would be like writing the one line of source code, “y = sin(ωt+φ)”, and then compiling that source code into assembly language. I think it’s obvious which one would require less compute!
(Silicon chips are maybe 7 OOM faster than brains. A faster but less parallel processor can emulate a slower but more parallel processor, but not vice-versa. So there’s a whole world of possible algorithm implementation strategies that brains cannot take advantage of but that we can—directly calculating sin(ωt+φ) is just one example.)
The scenario I’m talking about (see assumptions in Section 1) is the first bullet point above, not the second. So I consider an inefficiency factor <<1 to be a default expectation, again leaving aside the very earliest thrown-together implementations.
6. Some other timeline-relevant considerations
6.1 How long does it take to get from janky grad-student code to polished, scalable, parallelized, hardware-accelerated, turn-key learning algorithms?
On the assumptions of Section 1, a brain-like learning algorithm would be sufficiently different from DNNs that some of the existing DNN-specific infrastructure would need to be re-built (things like PyTorch, TPU chips, pedagogical materials, a trained workforce, etc.).
How much time would that add?
Well I’ll try to draw an analogy with the history of DNNs (warning: I’m not terribly familiar with the history of DNNs).
AlexNet was 2012, DeepMind patented deep Q learning in 2014, the first TensorFlow release was 2015, the first PyTorch release was 2016, the first TPU was 2016, and by 2019 we had billion-parameter GPT-2.
So, maybe 7 years?
But that may be an overestimate. I think a lot of the deep neural net infrastructure will carry over to even quite different future ML algorithms. For example, the building up of people and money in ML, the building up of GPU servers and the tools to use them, the normalization of the idea that it’s reasonable to invest millions of dollars to train one model and to fab ML ASICs, the proliferation of expertise related to parallelization and hardware-acceleration, etc.—all these things would transfer directly to future human-brain-like learning algorithms. So maybe they’ll be able to develop in less time than it took DNNs to develop in the 2010s.
So, maybe the median guess should be somewhere in the range of 3-6 years?
6.2 How long (wall-clock time) does it take to train one of these models?
Should we expect engineers to be twiddling their thumbs for years and years, as their training runs run? If so, that would obviously add to the timeline.
The relevant factor here is limits to parallelization. If there weren’t limits to parallelization, you could make wall-clock time arbitrarily low by buying more processing power. For example, AlphaStar training took 14 days and totaled 1e23 FLOP, so it’s presumably feasible to squeeze a 1e24-FLOP, 30-subjective-year, training run into 14×10=140 days—i.e., 80 subjective seconds per wall-clock second. With more money, and another decade or two of technological progress, and a brain-vs-computer inefficiency factor <<1 as above, it would be even faster. But that case study only works if our future brain-like algorithms are at least as parallelizable as AlphaStar was.
Maybe my starting point should be the AI Impacts’s Brain Performance In TEPS writeup? This comparison implies that existing supercomputers—as of the 2015 writeup—were not quite capable of real-time brain simulations (1 subjective second per wall-clock second), but they were within an order of magnitude. This makes it seem unlikely that we can get orders of magnitude faster than real-time. So, maybe we’ll be running our training algorithms for decades after all??
I’m not so sure. I still think it might well be much faster.
The most important thing is: I’m not a parallelization expert, but I assume that chip-to-chip connections are the bottleneck for the TEPS benchmark, not within-chip connections. (Someone please tell me if I’m wrong!) If I understand correctly, TEPS assumes that data is sent from an arbitrary node in the graph to a randomly-chosen different arbitrary node in the graph. So for a large calculation (more than a few chips), TEPS implicitly assumes that almost all connections are chip-to-chip. However, I think that in a brain simulation, data transmission events would be disproportionately likely to be within-chip.
For example, with adult brain volume of 1e6 , and an AlphaStar-like 400 silicon chips, naively each chip might cover about (13.5mm) of brain volume. So any neuron-to-neuron connection much shorter than 13.5mm is likely to translate to within-chip communication, not chip-to-chip. Then the figures at this AI Impacts page imply that almost all unmyelinated fiber transmission would involve within-chip communication, and thus, chip-to-chip communication would mainly consist of:
- Information carried by long-range myelinated fibers. Using the AI Impacts figure of 160,000km of myelinated fibers, let’s guess that they're firing at 0.1-2 Hz and typically 5cm long, then I get (3-60)e8 chip-to-chip TEPS from this source;
- Information carried by short-range fibers that happen to be near the boundary between the simulation zones of two chips. If you make a planar slice through the brain, I guess you would cut through on average ~3.5e11 axons and dendrites per of slice (from 850,000km of axons and dendrites in a brain). (Warning: a different estimation method gave 6e12 per instead. Part of the discrepancy is probably that the latter is cortex and the former is the whole brain, including white matter which is presumably much more spaced out. Or maybe the AI Impacts 850,000km figure is wrong. Anyway, take all this with a large grain of salt.) So again if we imagine 400 chips each simulating a little (13.5mm) cube of brain, we get ~0.22 of total “virtual slices”, and if they’re firing at 0.1-2 Hz, we get something like (0.8-16)e10 chip-to-chip TEPS from this source
Recall the headline figure of “brain performance in TEPS” was 1.8-64e13. So the above is ~3 OOM less! If I didn’t mess up, I infer a combination of (1) disproportionate numbers of short connections which turn into within-chip communications, and (2) a single long-range myelinated axon that connects to a bunch of neurons near its terminal, which from a chip-to-chip-communications perspective would look like just one connection.
Some other considerations that seem to point in the direction of “wall-clock training time probably won’t be years and years”:
- Technology is presumably improving, especially around processor-to-processor communications, and presumably it will continue to do so. For example, it looks like the highest-TEPS supercomputer increased from 2.4e13 TEPS to 1.0e14 TEPS between 2014 and 2021, if I’m reading this right. (The second highest is still 2.4e13 though!)
- Again I’m not a parallelization expert, so maybe this is super-naive, but: whatever algorithms the brain is using, they’ve gotta be extremely parallelizable, right? Remember, we’re working with silicon chips that are ~7 OOM faster than the brain; even if we’re a whopping 100,000× less skillful at parallelizing brain algorithms than the brain itself, we’d still be able to simulate a brain at 100× speedup. So I guess I’d be pretty surprised if wall-clock time winds up being a showstopper, just on general principles.
- As mentioned above, I’m expecting the computer-vs-brain inefficiency factor to be <<1. I was talking about FLOPs there, but I think the same argument applies to TEPS.
- This is probably a <1 OOM effect, but I’ll say it anyway: I bet the “30 subjective years” figure is way overkill for TAI. Like, the smartest 15-year-old humans are much better programmers etc. than most adults, and even those smart 15-year-olds sure didn’t spend every minute of those 15 years doing optimally-efficient learning!!
- Update: See this comment about the possibility of "parallel experiences".
Update to add: Here’s another possible objection. training requires both compute and data. Even if we can muster enough compute, what if data is a bottleneck? In particular, suppose for the sake of argument that the only way to train a model to AGI involves having the model control a real-world robot which spends tens of thousands of hours of serial time manipulating human-sized objects and chatting with humans. (And suppose also that “parallel experiences” wind up being impossible). Then that would limit model training speed, even if we had infinitely fast computers. However, I view that possibility as highly unlikely—see my discussion of “embodiment” in this post (Section 1.5). My strong expectation is that future programmers will be able to make AGI just fine by feeding it YouTube videos, books, VR environments, and other such easily-sped-up data sources, with comparatively little real-world-manipulation experience thrown in at the very end. (After all, going in the opposite direction, humans can learn very quickly to get around in a VR environment after a lifetime in the real world.)
6.3 How many full-length training runs do we need?
If a “full-length training run” is the 30 subjective years or whatever, then an additional question is: how many such runs will we need to get TAI? I’m inclined to say: as few as 1 or 2, plus lots and lots of smaller-scale studies. For example, I believe there was one and only one full training run of GPT-3—all the hyperparameters were extrapolated from smaller-scale studies, and it worked well enough the first time.
Note also that imperfect training runs don’t necessarily need to be restarted from scratch; the partially-trained model may well be salvageable, I’d assume. And it’s possible to run multiple experiments in parallel, especially when there’s a human in the loop contextualizing the results.
So anyway, combining this and the previous subsection, I think it’s at least plausible for “wall-clock time spent running training” to be a minor contributor to TAI timelines (say, adding <5 years). That’s not guaranteed, just plausible. (As above, "plausible" = ">25% probability I guess").
I’ll just repeat what I said in Section 2 above: if you accept the assumptions in section 1, I think we get the following kind of story:
We can’t train a lifetime-anchor model today because we haven’t pinned down the brain-like learning algorithms that would be needed for it. But when we understand the secret sauce, we could plausibly be <10 years away from optimized, tested, well-understood, widely-used, industrial-scale systems for training these models all the way to TAI. And this training could plausibly be easily affordable, as in <$10M—i.e., a MASSIVE hardware overhang.
(Thanks Dan Kokotajlo & Logan Smith for critical comments on drafts.)
Warning: FLOP is only one of several inputs to an algorithms. Another input worth keeping in mind is memory. In particular, the human neocortex has ≈ synapses. How this number translates into (for example) GB of GPU memory is complicated, and I have some uncertainty, but I think my Section 6.2 scenario (involving an AlphaStar-like 400 chips) does seem to be in the right general ballpark for not only FLOP but also memory storage.
I assumed the axons and dendrites are locally isotropic (equally likely to go any direction); that gives a factor of 2 from averaging cos θ over a hemisphere.