We are headed into an extreme compute overhang

devrandom

We are headed into an extreme compute overhang

2 min read26th Apr 202421 comments

40

ComputeAI TimelinesComputing OverhangSuperintelligenceAI

If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model.

Definitions

Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff.

I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here).

I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post.

Thesis

Due to practical reasons, the compute requirements for training LLMs is several orders of magnitude larger than what is required for running a single inference instance. In particular, a single NVIDIA H100 GPU can run inference at a throughput of about 2000 tokens/s, while Meta trained Llama3 70B on a GPU cluster^[1] of about 24,000 GPUs. Assuming we require a performance of 40 tokens/s, the training cluster can run concurrent instances of the resulting 70B model.

I will assume that the above ratios hold for an AGI level model. Considering the amount of data children absorb via the vision pathway, the amount of training data for LLMs may not be that much higher than the data humans are trained on, and so the current ratios are a useful anchor. This is explored further in the appendix.

Given the above ratios, we will have the capacity for ~1e6 AGI instances at the moment that training is complete. This will likely lead to superintelligence via "collective superintelligence" approach. Additional speed may be then available via accelerators such as GroqChip, which produces 300 tokens/s for a single instance of a 70B model. This would result in a "speed superintelligence" or a combined "speed+collective superintelligence".

From AGI to ASI

With 1e6 AGIs, we may be able to construct an ASI, with the AGIs collaborating in a "collective superintelligence". Similar to groups of collaborating humans, a collective superintelligence divides tasks among its members for concurrent execution.

AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.

Tasks that are inherently serial would benefit more from a speedup instead of a division of tasks. An accelerator such as GroqChip will be able to accelerate serial thought speed by a factor of 10x or more.

Counterpoints

It may be the case that a collective of sub-AGI models can reach AGI capability. It would be advantageous if we could achieve AGI earlier, with sub-AGI components, at a higher hardware cost per instance. This will reduce the compute overhang at the critical point in time.
There may a paradigm change on the path to AGI resulting in smaller training clusters, reducing the overhang at the critical point.

Conclusion

A single AGI may be able to replace one human worker, presenting minimal risk. A fleet of 1,000,000 AGIs may give rise to a collective superintelligence. This capability is likely to be available immediately upon training the AGI model.

We may be able to mitigate the overhang by achieving AGI with a cluster of sub-AGI components.

Appendix - Training Data Volume

A calculation of training data processed by humans during development:

time: ~20 years, or 6e8 seconds
raw data input: ~10 mb/s = 1e7 b/s
total for human training data: 6e15 bits
Llama3 training size: 1.5e13 tokens * 16 bits =~ 2e14 bits

The amount of data used for training current generation LLMs seems comparable to the amount processed by humans during childhood.

References

Measuring hardware overhang - discusses a slightly different dynamic - hardware overhang related to algorithmic improvements
[added] Before smart AI, there will be many mediocre or specialized AIs

^{^}
two clusters are actually in production, and a 400B model is still being trained

ComputeAI TimelinesComputing OverhangSuperintelligenceAI

Frontpage

40

New Comment

21 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:00 AM

[-]faul_sname11d214

AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.

I think this only holds if fine tunes are composable, which as far as I can tell they aren't (fine tuning on one task subtly degrades performance on a bunch of other tasks, which isn't a big deal if you fine tune a little for performance on a few tasks but does mean you probably can't take a million independently-fine-tuned models and merge them into a single super model of the same size with the same performance on all million tasks).

Also there are sometimes mornings where I can't understand code I wrote the previous night when I had all of the necessary context fresh to me, despite being the same person. I expect that LLMs will exhibit the same behavior of some things being hard to understand when examined out of the context which generated them.

That's not to say a worldin which there are a billion copies of GPT-5 running concurrently will have no major changes, but I don't think a single coherent ASI falls out of that world.

[-]gwern6d2211

I think this only holds if fine tunes are composable, which as far as I can tell they aren't

You know 'finetunes are composable', because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}.

If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. This works because you don't insist on everything being up to date every moment and you accept that there will be degrees of inconsistency/outdatedness. (You are certainly not accumulating the gradient across the entire cluster by waiting for every single node, pausing everything, calculating a single global step, and pushing it out, and only then resuming, as if it were a single GPU! Really, you don't even want to do that on a single GPU for DRL if you gotta go fast.) This works so well that people will casually talk about training "an" AlphaZero, even though they actually mean something more like "the 512 separate instances of AlphaZero we are composing finetunes of" (or more).*

You do have issues with stale gradients and off-policyness of updates and how to best optimize throughput of all of the actors vs training nodes and push out model updates efficiently so nodes stop executing outdated parameters as quickly as possible, and DeepMind & OpenAI etc have done a lot of work on that - but at that point, as in the joke, you have conceded that finetunes are composable and you can keep a very large number of replicas in sync, and it is merely a matter of haggling over how much efficiency you lose.

Also note that it takes a lot less compute to keep a model up to date doing simple online learning on new data than it does to train it from scratch on all historical data summed together (obviously), so what devrandom is talking about is actually a lot easier than creating the model in the first place.

A better model to imagine is not "somehow finetunes from millions of independent models magically compose" (although actually they would compose pretty well), but more like, "millions of independent actors do their ordinary business, while spending their spare bandwidth downloading the latest binary delta from peer nodes (which due to sparsity & not falling too far out of sync, is always on the order of megabytes, not terabytes), and once every tens of thousands of forward passes, discover a novel or hard piece of data, and mail back a few kilobytes of text to the central training node of a few thousand GPUs, who are continually learning on the hard samples being passed back to them by the main fleet, and who keep pushing out an immediately updated model to all of the actor models, and so 'the model' is always up to date and no instance is more than hours out of date with 'the model' (aside from the usual long tail of stragglers or unhealthy nodes which will get reaped)".

* I fear this is one of those cases where our casual reification of entities leads to poor intuitions, akin to asking 'how many computers are in your computer you are using right now?'; usually, the answer is just '1', because really, who cares how exactly your 'smartphone' or 'laptop' or 'desktop' or 'server' is made up of a bunch of different pieces of silicon - unless you're discussing something like device performance or security, in which case it may matter quite a lot and you'd better not think of yourself as owning 'a' smartphone.

[-]faul_sname6d20

I think we may be using words differently. By "task" I mean something more like "predict the next token in a nucleotide sequence" and less like "predict the next token in this one batch of training data that is drawn from the same distribution as all the other batches of training data that the parallel instances are currently training on".

It's not an argument that you can't train a little bit on a whole bunch of different data sources, it's an argument that running 1.2M identical instances of the same model is leaving a lot of predictive power on the table as compared by having those models specialize. For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.

Once you have a bunch of specialized models "the weights are identical" and "a fine tune can be applied to all members" no longer holds.

[-]gwern6d124

For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.

I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).

Once you have a bunch of specialized models "the weights are identical" and "a fine tune can be applied to all members" no longer holds.

Nor do I see how this is relevant to your original claim. If you have lots of task-specialist models, how does this refute the claim that those will be able to coordinate? Of course they will. They will just share weight updates in exactly the way I just outlined, which works so well in practice. You may not be able to share parameter-updates across your protein-only and your Python-only LLMs, but they will be able to share updates within that model family and the original claim ("AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.") remains true, no matter how you swap out your definition of 'model'.

DL models are fantastically good at collaborating and updating each other, in many ways completely impossible for humans, whether you are talking about AGI models or narrow specialist models.

[-]faul_sname6d20

I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).

Man, that Li et al paper has pretty wild implications if it generalizes. I'm not sure how to square those results with the Chinchilla paper though (I'm assuming it wasn't something dumb like "wall-clock time was better with larger models because training was constrained by memory bandwidth, not compute")

In any case, my point was more "I expect dumb throw-even-more-compute-at-it approaches like MoE, which can improve their performance quite a bit at the cost of requiring ever more storage space and ever-increasing inference costs, to outperform clever attempts to squeeze more performance out of single giant models". If models just keep getting bigger while staying monolithic, I'd count that as pretty definitive evidence that my expectations were wrong.

Edit: For clarity, I specifically expect that MoE-flavored approaches will do better because, to a first approximation, sequence modelers will learn heuristics in order of most to least predictive of the next token. That depends on the strength of the pattern and the frequency with which it comes up.

As a concrete example, the word "literally" occurs with a frequency of approximately 1/100,000. About 1/6,000 times it occurs, the word "literally" is followed by the word "crying", while about 1/40,000 of occurrences of the word "literally" are followed by "sobbing". If you just multiply it out, you should assume that if you saw the word "literally", the word "crying" should be about 7x more likely to occur than the word "sobbing". One of the things a language model could learn, though, is that if your text is similar to text from the early 1900s, that ratio should be more like 4:1, whereas if it's more like text from the mid 1900s it should be more like 50:1. Learning the conditional effect of the year of authorship on the relative frequencies of those 2-grams will improve overall model loss by about 3e-10 bits per word, if I'm calculating correctly (source: google ngrams).

If there's some important fact about one specific unexpected nucleotide which occurs in half of mammalian genomes, but nucleotide sequence data is only 1% of your overall data and the other data you're feeding the model includes text, your model will prefer to learn a gajillion little linguistic facts on the level of the above over learning this cool tidbit of information about genomes. Whereas if you separate out the models learning linguistic tidbits from the ones predicting nucleotide sequences, learning little linguistic tricks will trade off against learning other little linguistic tricks, and learning little genetics facts will trade off against learning other little genetics facts.

And if someone accidentally dumps some database dumps containing a bunch of password hashes into the training dataset then only one of your experts will decide that memorizing a few hundred million md5 digests is the most valuable thing it could be doing, while the rest of your experts continue chipping happily away at discovering marginal patterns in their own little domains.

[-]Algon10d20

I think this only holds if fine tunes are composable, which as far as I can tell they aren't (fine tuning on one task subtly degrades performance on a bunch of other tasks, which isn't a big deal if you fine tune a little for performance on a few tasks but does mean you probably can't take a million independently-fine-tuned models and merge them into a single super model of the same size with the same performance on all million tasks).

I don't think I've ever heard of any evidence for this being the case.

[-]faul_sname8d41

Probably the best search terms are "catastrophic interference" or "catastrophic forgetting". Basically, the issue is that if you take some model that is tuned on some task, and then fine-tune it on a different, unrelated task, performance on the first task will tend to degrade.

From a certain perspective, it's not particularly surprising that this happens. If you have a language model with 7B 32 bit parameters, that language model can at most contain 28GB of compressed information. If the model is "full", any new information you push into it must necessarily "push" some other information out of it.

There are a number of ways to mitigate this issue, and in fact there's a whole field of research into ways to mitigate this issue. Examples:

Multitask Learning: Instead of training on a bunch of examples of task A, and then a bunch of examples of task B, interleave the examples of A and B. The model trained on A and B will perform better on both tasks A and B than the pretrained base model on both tasks A and B, though it will not perform as well as (the base model trained only on A) or (the base model trained only on B).
Knowledge Distillation: Like multitask learning, except that instead of directly fine-tuning a model on both tasks A and B, you instead do separate fine-tunes on A and on B and use knowledge distillation to train a third model to imitate the outputs of the fine-tuned-on-A or fine-tuned-on-B model, as appropriate for the training datapoint
Mixture of Experts: Fine tune one model on A, and another on B, and then train a third model to predict which model should be used to make a prediction for each input (or more accurately, how the predictions of each expert model should be weighted in determining the output). This can scale to an almost arbitrary number of tasks, but the cost scales linearly with the number of experts (or better-than-linearly if you're clever about it, though the storage requirements still scale linearly with the number of experts).

[-]devrandom7d10

I think this only holds if fine tunes are composable [...] you probably can't take a million independently-fine-tuned models and merge them [...]

The purpose of a fine-tune is to "internalize" some knowledge - either because it is important to have implicit knowledge of it, or because you want to develop a skill.

Although you may have a million instances executing tasks, the knowledge you want to internalize is likely much more sparse. For example, if an instance is tasked with exploring a portion of a search space, and it doesn't find a solution in that portion, it can just summarize its finding in a few words. There might not even be a reason to internalize this summary - it might be merged with other summaries for a more global view of the search landscape.

So I don't see the need for millions of fine-tunes. It seems more likely that you'd have periodic fine-tunes to internalize recent progress - maybe once an hour.

The main point is that the single periodic fine-tune can be copied to all instances. This ability to copy the fine-tune is the main advantage of instances being identical clones.

[-]ryan_greenblatt11d159

[-]devrandom11d10

Thank you, I missed it while looking for prior art.

[-]snewman11d106

Assuming we require a performance of 40 tokens/s, the training cluster can run concurrent instances of the resulting 70B model

Nit: you mixed up 30 and 40 here (should both be 30 or both be 40).

I will assume that the above ratios hold for an AGI level model.

If you train a model with 10x as many parameters, but use the same training data, then it will cost 10x as much to train and 10x as much to operate, so the ratios will hold.

In practice, I believe it is universal to use more training data when training larger models? Implying that the ratio would actually increase (which further supports your thesis).

On the other hand, the world already contains over 8 billion human intelligences. So I think you are assuming that a few million AGIs, possibly running at several times human speed (and able to work 24/7, exchange information electronically, etc.), will be able to significantly "outcompete" (in some fashion) 8 billion humans? This seems worth further exploration / justification.

[-]Brendan Long11d30

Having 1.6 million identical twins seems like a pretty huge advantage though.

[-]snewman9d20

Can you elaborate? This might be true but I don't think it's self-evidently obvious.

In fact it could in some ways be a disadvantage; as Cole Wyeth notes in a separate top-level comment, "There are probably substantial gains from diversity among humans". 1.6 million identical twins might all share certain weaknesses or blind spots.

[-]devrandom6d10

The main advantage is that you can immediately distribute fine-tunes to all of the copies. This is much higher bandwidth compared to our own low-bandwidth/high-effort knowledge dissemination methods.

The monolithic aspect may potentially be a disadvantage, but there are a couple of mitigations:

AGI are by definition generalists
you can segment the population into specialists (see also this comment about MoE)

[-]devrandom11d10

On the other hand, the world already contains over 8 billion human intelligences. So I think you are assuming that a few million AGIs, possibly running at several times human speed (and able to work 24/7, exchange information electronically, etc.), will be able to significantly "outcompete" (in some fashion) 8 billion humans? This seems worth further exploration / justification.

Good point, but a couple of thoughts:

the operational definition of AGI referred in the article is significantly stronger than the average human
the humans are poorly organized
the 8 billion humans are supporting a civilization, while the AGIs can focus on AI research and self-improvement

[-]snewman9d20

All of this is plausible, but I'd encourage you to go through the exercise of working out these ideas in more detail. It'd be interesting reading and you might encounter some surprises / discover some things along the way.

Note, for example, that the AGIs would be unlikely to focus on AI research and self-improvement if there were more economically valuable things for them to be doing, and if (very plausibly!) there were not more economically valuable things for them to be doing, why wouldn't a big chunk of the 8 billion humans have been working on AI research already (such that an additional 1.6 million agents working on this might not be an immediate game changer)? There might be good arguments to be made that the AGIs would make an important difference, but I think it's worth spelling them out.

[-]lukehmiles11d75

This seems correct and important to me.

[-]Nathan Helm-Burger10d41

As a population of AGI copies, the obvious first step towards 'taking over the world' is to try to improve oneself.

I expect that the described workforce could find improvements within a week of clock time including one or more of:

Improvements to peak intelligence without needing to fully retrain.

Improvements to inference efficiency.

Improvements to ability to cooperate and share knowledge.

[-]Cole Wyeth10d30

I have no reason to question your evidence but I don't agree with your arguments. It is not clear that a million LLM's coordinate better an a million humans. There are probably substantial gains from diversity among humans, so the identical weights you mentioned could cut in either direction. An additional million human level intelligences would have a large economic impact, but not necessarily a transformative one. Also, your argument for speed superintelligence is probably flawed; since you're discussing what happens immediately after the first human level AGI is created, gains from any speedup in thinking should already be factored in and will not lead to superintelligence in the short term.

[-]Seth Herd10d30

The big question here, it seems like, is: does intelligence stack? Does a hundred thousand instances of GPT4 working together make an intelligence as smart as GPT7?

This far the answer seems to be no. There are some intelligence improvements from combining multiple calls in tree of thought type setups, but not much. And those need carefully hand-structured algorithms.

So I think the limitation is in scaffolding techniques, not the sheer number of instances you can run. I do expect scaffolding LLMs into cognitive architectures to achieve human level fully general AGI, but how and when we get there is tricky to predict.

When we have that, I expect it to stack a lot like human organizations. They can do a lot more work at once, but they're not much smarter than a single individual because it's really hard to coordinate and stack all of that cognitive work.

[-]Stephen McAleese11d30

Currently, groups of LLM agents can collaborate using frameworks such as ChatDev, which simulates a virtual software company using LLM agents with different roles. Though I think human organizations are still more effective for now. For example, corporations such as Microsoft have over 200,000 employees and can work on multi-year projects. But it's conceivable that in the future there could be virtual companies composed of millions of AIs that can coordinate effectively and can work continuously at superhuman speed for long periods of time.

Moderation Log