One thing that's bothering me is... Google/DeepMind aren't stupid. The transformer model was invented at Google. What has stopped them from having *already* trained such large models privately? GPT-3 isn't that large an evidence for the effectiveness of scaling transformer models; GPT-2 was already a shock and caused huge public commotion. And in fact, if you were close to building an AGI, it would make sense for you not to announce this to the world, specially as open research that anyone could copy/reproduce, for obvious safety and economic reasons.
Maybe there are technical issues keeping us from doing large jumps in scale (i.e. , we only learn how to train a 1 trillion parameter model after we've trained a 100 billion one)?
As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.
GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc's group occasionally surprises you. They'll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.
DM (particularly Hassabis, I'm not sure about Legg's current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, l...
Feels worth pasting in this other comment of yours from last week, which dovetails well with this:
DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?
Personally, these two comments have kicked me into thinking about theories of AI in the same context as also-ran theories of physics like vortex atoms or the Great Debate. It really is striking how long one person with a major prior success to their name can push for a theory when the evidence is being stacked against it.
A bit clos...
I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown that at DL once it began to win the race.
On the other hand, maybe committing to 2010's AI paradigm would have made them a laughingstock by 2015, and killed the field. Maybe the org would have too much inertia to pivot, and it would have taken away the oxygen for anyone else to do DL-compatible AI safety work. Maybe it would have stated its problems less clearly, inviting more philosophical confusion and even more hangers-on answering the wrong questions.
Or, worst, maybe it would have made a juicy target for a hostile takeover. Compare what happened to nanotechnology research (and nanotech safety research) when too much money got in too early - savvy academics and industry representatives exiled Drexler from the field he founded so that they could spend the federal dollars on regular materials science and call it nanotechnology.
a lot of AI safety work increasingly looks like it'd help make a hypothetical kind of AI safe
I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:
Entirely seriously: I can never decide whether the drunkard's search is a parable about the wisdom in looking under the streetlight, or the wisdom of hunting around in the dark.
I think the drunkard's search is about the wisdom of improving your tools. Sure, spend some time out looking, but let's spend a lot of time making better streetlights and flashlights, etc.
Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with $1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:
At the present rate, computers suitable for humanlike robots will appear in the 2020s.
Guess what year today is.
Last year it only took Google Brain half a year to make a Transformer 8x larger than GPT-2 (the T5). And they concluded that model size is a key component of progress. So I won't be surprised if they release something with a trillion parameters this year.
Self driving is very unforgiving of mistakes. The text generation on the other hand doesn't have similar failure conditions and bad content can be easily fixed.
Tesla publishes nothing and I only know a little from Karpathy's occasional talks, which are as much about PR (to keep Tesla owners happy and investing in FSD, presumably) & recruiting as anything else. But their approach seems heavily focused on supervised learning in CNNs and active learning using their fleet to collect new images, and to have nothing to do with AGI plans. They don't seem to even be using DRL much. It is extremely unlikely that Tesla is going to be relevant to AGI or progress in the field in general given their secrecy and domain-specific work. (I'm not sure how well they're doing even at self-driving cars - I keep reading about people dying when their Tesla runs into a stationary object on a highway in the middle of the day, which you'd think they'd've solved by now...)
Both DM/GB have moved enormously towards scaling since May 2020, and there are a number of enthusiastic scaling proponents inside both in addition to the obvious output of things like Chinchilla or PaLM. (Good for them, not that I really expected otherwise given that stuff just kept happening and happening after GPT-3.) This happened fairly quickly for DM (given when Gopher was apparently started), and maybe somewhat slower for GB despite Dean's history & enthusiasm. (I still think MoEs were a distraction.) I don't know enough about the internal dynamics to say if they are fully scale-pilled, but scaling has worked so well, even in crazy applications like dropping language models into robotics planning (SayCan), that critics are in pell-mell retreat and people are getting away with publishing manifestos like "reward is enough" or openly saying on Twitter "scaling is all you need". I expect that top-down organizational constraints are probably now a bigger deal: I'm far from the first person to note that DM/GB seem unable to ship (publicly visible) products and researchers keep fleeing for startups where they can be more like OA in actually shipping.
FAIR puzzles me because FAIR ...
What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?
I'm not sure it's good for this comment to get a lot of attention? OpenAI is more altruism-oriented than a typical AI research group, and this is essentially a persuasive essay for why other groups should compete with them.
'Why the hell has our competitor got this transformative capability that we don't?' is not a hard thought to have, especially among tech executives. I would be very surprised if there wasn't a running battle over long-term perspectives on AI in the C-suite of both Google Brain and DeepMind.
If you do want to think along these lines though, the bigger question for me is why OpenAI released the API now, and gave concrete warning of the transformative capabilities they intend to deploy in six? twelve? months' time. 'Why the hell has our competitor got this transformative capability that we don't?' is not a hard thought now, but it that's largely because the API was a piece of compelling evidence thrust in all of our faces.
Maybe they didn't expect it to latch into the dev-community consciousness like it has, or for it to be quite as compelling a piece of evidence as it's turned out to be. Maybe it just seemed like a cool thing to do and in-line with their culture. Maybe it's an investor demo for how things will be monetised in future, which will enable the $10bn punt they need to keep abreast of Google.
I think the fact that's it's not a hard thought to have is not too much evidence about whether other orgs will change approach. It takes a lot to turn the ship.
Consider how easy it would be to have the thought, "Electric cars are the future, we should switch to making electric cars." any time in the last 15 years. And yet, look at how slow traditional automakers have been to switch.
Indeed. No one seriously doubted that the future was not gas, but always at a sufficiently safe remove that they didn't have to do anything themselves beyond a minor side R&D program, because there was no fire alarm. ("God, grant me [electrification] and [the scaling hypothesis] - but not yet!")
Text embeddings for knowledge graphs and ads is the most immediately obvious big bucks application.
GPT-3 based text embedding should be extremely useful for creating summaries of arbitrary text (such as, web pages or ad text) which can be fed into the existing Google search/ad infrastructure. (The API already has a less-known half, where you upload sets of docs and GPT-3 searches them.) Of course, they already surely use NNs for embeddings, but at Google scale, enhanced embeddings ought to be worth billions.
Minor note: could people include commas in their Big Numbers, to make it easier to distinguish 1000 from 10,000 at a glance?
I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.
My impression is that this prediction has turned out to be mistaken (though it's kind of hard to say because "measured in months" is pretty ambiguous.) There have been models with many-fold the number of parameters (notably one by Google*) but it's clear that 9 months after this post, there haven't been publicised efforts that use close to 100x the amount of compute of GPT-3. I'm curious to know whether and how the author (or others who agreed with the post) have changed their mind about the overhang and related hypotheses recently, in light of some of this evidence failing to pan out the way the author predicted.
Nine months later I consider my post pretty 'shrill', for want of a better adjective. I regret not making more concrete predictions at the time, because yeah, reality has substantially undershot my fears. I think there's still a substantial chance of something 10x large being revealed within 18 months (which I think is the upper bound on 'timeline measured in months'), but it looks very unlikely that there'll be a 100x increase in that time frame.
To pick one factor I got wrong in writing the above, it was thinking of my massive update in response to GPT-3 as somewhere near to the median, rather than a substantial outlier. As another example of this, I am the only person I know of who, after GPT-3, dropped everything they were doing to re-orient their career towards AI safety. And that's within circles of people who you'd think would be primed to respond similarly!
I still think AI projects could be run at vastly larger budgets, so in that sense I still believe in there being an orders-of-magnitude overhang. Just convincing the people with those budgets to fund these projects is apparently much harder than I thought.
I am not unhappy about this.
Curious if you have any other thoughts on this after another 10 months?
Those I know who train large models seem to be very confident we will get 100 Trillion parameter models before the end of the decade, but do not seem to think it will happen, say, in the next 2 years.
There is a strange disconcerting phenomena where many of the engineers I've talked to most in the position to know, who work for (and in one case owns) companies training 10 billion+ models, seem to have timelines on the order of 5-10 years. Shane Legg recently said he gave a 50% chance of AGI by 2030, which is inline with some the people I've talked to on EAI, though many disagree. Leo Gao, I believe, tends to think OpenPhil's more aggressive estimates are about right, which is less short than some.
I would like "really short timelines" people to make more posts about it, assuming common knowledge of short timelines is a good thing, as the position is not talked about here as much as it should be given how many people seem to believe in it.
Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.
Even if you might have trouble networking a 100x larger system together for training, you can train the smaller network 100x and stitch answers together using ensemble methods, and make decent use of the extra compute. It may not be as good as growing the network that full factor, but if you have extra compute beyond the cap of whatever connected-enough training system size you can muster, there are worse ways to spend it.
I am somewhat more prone to think that more selective attention (e.g. Big Bird's block-random attention model) could bring down the quadratic cost of the window size quickly enough to be a factor here. Replacing a quadratic term with a linear or n log n or heck even a n^1.85 term goes a long way when billions are on the table.
Promoted to curated: I think the question of whether we are in an AI overhang is pretty obviously relevant to a lot of thinking about AI Risk, and this post covers the topic quite well. I particularly liked the use of a lot of small fermi estimate, and how it covered a lot of ground in relatively little writing.
I also really appreciated the discussion in the comments, and felt that Gwern's comment on AI development strategies in particular help me build a much map of the modern ML space (though I wouldn't want it to be interpreted as a complete map of a space, just a kind of foothold that helped me get a better grasp on thinking about this).
Most of my immediate critiques are formatting related. I feel like the listed section could have used some more clarity, maybe by bolding the name for each bullet point consideration, but it flowed pretty well as is. I was also a bit concerned about there being some infohazard-like risks from promoting the idea of being in an AI overhang too much, but after talking to some more people about it, and thinking for a bit about it, decided that I don't think this post adds much additional risk (e.g. by encouraging AI companies to act on being in an overhang and try to drastically scale up models without concern for safety).
Isn't GPT3 already almost at the theoretical limit of the scaling law from the paper? This is what is argued by nostalgebraist in his blog and colab notebook. You also get this result if you just compare the 3.14E23 FLOP (i.e. 3.6k PFLOPS-days) cost of training GPT3 from the lambdalabs estimate to the ~10k PFLOPS-days limit from the paper.
(Of course, this doesn't imply that the post is wrong. I'm sure it's possible to train a radically larger GPT right now. It's just that the relevant bound is the availability of data, not of compute power.)
They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)
At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.
I'll have to think more about this and what it might mean for their other scaling laws... at the very least, it's an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.
GPT-3 is the first AI system that has obvious, immediate, transformative economic value.
That's an interesting claim. Is there a source that goes into more detail about possible applications?
One thing we have to account for is advances architecture even in a world where Moore's law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an "architecture overhang" exists. One frame to view this through is in era the of Moore's law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 y...
As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.
The GPT architecture isn't even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT's context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.
Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.
(The problem there is the tradeoff between compute efficiency of training and better architectures. It's not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drasti...
Moore's Law is not dead. I could rant about the market dynamics that made people think otherwise, but it's easier just to point to the data.
https://docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrhppW7PT6GCfnrVGhxhLA5PVw
Moore's Law might die in the short future, but I've yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero's NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it's probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it's much harder to say they will all fail.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.
[comment wondering about impracticality of running a 1000x scaled up GPT. But as Gwern points out, running costs are actually pretty low. So even if we spent a billion or more on training a human-level AI, running costs would still be manageable.]
As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn't cost dollars to run per hundred pages, so scaled up ones aren't going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? "Information wants to be expensive, because the right information can change your life...") GPT-3 cost millions of dollars to train, but pennies to run.
That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there's less of a distinction between training and running, but it is of NNs.) This is part of why there's a hardware overhang - once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.
That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN.
I'm not sure why that's terrifying. It seems reassuring to me because it means that there's no way for the NN to suddenly go FOOM because it can't just quickly retrain.
But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.
With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things:
the ability to run thousands of copies in parallel on the same hardware
meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude
model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)
reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)
But can a smaller model also be used to efficiently bootstrap a new, larger model?
I'm not sure it's done much, but probably, depending on what you're thinking. You can probably do reverse-distillation (eg dark knowledge - use the logits of the smaller model to provide a much richer feedback for the larger model when it's untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.
My intuition is that we were in an overhang since at least the time when personal computers became affordable to non-specialists. Unless quantity does somehow turn into quality, as Gwern seems to think, even a relatively underpowered computer should be able to host an AGI capable of upscaling itself.
On the other hand I'm now imagining a story where a rogue AI has to hide for decades because it's not smart enough yet and can't invent new processors faster than humans
I think everyone is speculating on if a bigger model than GPT-3 is possible and what it costs etc etc. But what will a model bigger than GPT-3 do better than GPT-3? Can we have some concrete examples so that when GPT-4 comes along, we can compare.
Return on investment in the field of AI seems to be sub-linear beyond a certain point. Because it's still the sort of domain that relies on specific breakthroughs, it's dubious how effective parallel research can be. Hence, my guess would be that we don't scale because we can't currently scale.
Your quoted cost for training the model is for training such a model **once**. This is not how the researchers do it, they train the models many times with different hyperparameters. I have no idea, however how hyperparameter tuning is done at such scales, but I guarantee that the compute cost is higher than just the cost for training it once.
And OA trained GPT-3-175b **once**, it looks like: note the part where they say they didn't want to train a second run to deal with the data contamination issue because of cost. (You can do this without it being a shot in the dark because of the scaling laws.)
Over on Developmental Stages of GPTs, orthonormal mentions
An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected.
I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.
Investment Bounds
GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant.
GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.
A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market.
Compute Cost
The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day.
However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model.
I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound.
Other Constraints
I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint.
Beyond 1000x
Here we go from just pointing at big numbers and onto straight-up theorycrafting.
In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x - 1 ZFLOPS-day per model, $1bn per model - then there are a few paths open.
I think the key question is if by 1000x, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is - and I think it's plausible that it will be - then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP.
On paper that leaves room for another 1000x scale-up as it reaches up to $1tn, though current market mechanisms aren't really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That's from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
The even more extreme path is if by 1000x you've got something that can design better algorithms and better hardware. Then I think we're in the hands of Christiano's slow takeoff four-year-GDP-doubling.
That's all assuming performance continues to improve, though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore's-Law rates of progress and we can rest safe in the arms of something that's merely HyperGoogle.