Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Over on Developmental Stages of GPTs, orthonormal mentions

it at least reduces the chance of a hardware overhang.

An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected.

I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.

Investment Bounds

GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant.

GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour.

Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.

A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market.

Compute Cost

The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day.

However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model.

I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound.

Other Constraints

I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint.

  • Scaling law breakdown. The GPT series' scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table.
    • This could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.
  • Sequence length. GPT-3 uses 2048 tokens at a time, and that's with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length sequences is not very likely.
  • Data availability. From the same paper as the previous point, dataset size rises with the square-root of compute; a 1000x larger GPT-3 would want 10 trillion tokens of training data.
    • It’s hard to find a good estimate on total-words-ever-written, but our library of 130m books alone would exceed 10tn words. Considering books are a small fraction of our textual output nowadays, it shouldn't be difficult to gather sufficient data into one spot once you've decided it's a useful thing. So I'd be surprised if this was binding.
  • Bandwidth and latency. Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.
    • I don't know enough about distributed training to say whether this is a very sensible constraint or a very dumb one. I think it has a chance of being a serious problem, but I think it's also the kind of thing you can design algorithms around. Validating such algorithms might take more than a timescale of months however.
  • Hardware availability. From the estimates above there are about 500 GPU-years in GPT-3, or - based on a one-year training window - $5m worth of V100s at $10k/piece. This is about 1% of NVIDIA's quarterly datacenter sales. A 100x scale-up by multiple companies could saturate this supply.
    • This constraint can obviously be loosened by increasing production, but it'd be hard to on a timescale of months.
  • Commoditization. If many companies go for huge NLP models, the profit each company can extract is driven towards zero. Unlike with other capex-heavy research - like pharma - there's no IP protection for trained models. If you expect profit to be marginal, you're less likely to drop $1bn on your own training program.
    • I am skeptical of this being an important factor while there are lots of legacy, human-driven systems to replace. Replacing those systems should be more than enough incentive to fund many companies’ research programs. Longer term, the effects of commoditization might become more important.
  • Inference costs. The GPT-3 paper (§6.3), gives .4kWh/100 pages of output, which works out to 500 pages/dollar from eyeballing hardware cost as 5x electricity. Scaling up 1000x and you're at $2/page, which is cheap compared to humans but no longer quite as easy to experiment with.
    • I'm skeptical of this being a binding constraint. $2/page is still very cheap.

Beyond 1000x

Here we go from just pointing at big numbers and onto straight-up theorycrafting.

In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x - 1 ZFLOPS-day per model, $1bn per model - then there are a few paths open.

I think the key question is if by 1000x, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is - and I think it's plausible that it will be - then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP.

On paper that leaves room for another 1000x scale-up as it reaches up to $1tn, though current market mechanisms aren't really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.

That's from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.

The even more extreme path is if by 1000x you've got something that can design better algorithms and better hardware. Then I think we're in the hands of Christiano's slow takeoff four-year-GDP-doubling.

That's all assuming performance continues to improve, though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore's-Law rates of progress and we can rest safe in the arms of something that's merely HyperGoogle.

New Comment
107 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

One thing that's bothering me is... Google/DeepMind aren't stupid. The transformer model was invented at Google. What has stopped them from having *already* trained such large models privately? GPT-3 isn't that large an evidence for the effectiveness of scaling transformer models; GPT-2 was already a shock and caused huge public commotion. And in fact, if you were close to building an AGI, it would make sense for you not to announce this to the world, specially as open research that anyone could copy/reproduce, for obvious safety and economic reasons.

Maybe there are technical issues keeping us from doing large jumps in scale (i.e. , we only learn how to train a 1 trillion parameter model after we've trained a 100 billion one)?

Pinned by Raemon

As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.

GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc's group occasionally surprises you. They'll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.

DM (particularly Hassabis, I'm not sure about Legg's current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, l... (read more)

Feels worth pasting in this other comment of yours from last week, which dovetails well with this:

DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in & . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?

Personally, these two comments have kicked me into thinking about theories of AI in the same context as also-ran theories of physics like vortex atoms or the Great Debate. It really is striking how long one person with a major prior success to their name can push for a theory when the evidence is being stacked against it.

A bit clos... (read more)

I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.

Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown that at DL once it began to win the race.

On the other hand, maybe committing to 2010's AI paradigm would have made them a laughingstock by 2015, and killed the field. Maybe the org would have too much inertia to pivot, and it would have taken away the oxygen for anyone else to do DL-compatible AI safety work. Maybe it would have stated its problems less clearly, inviting more philosophical confusion and even more hangers-on answering the wrong questions.

Or, worst, maybe it would have made a juicy target for a hostile takeover. Compare what happened to nanotechnology research (and nanotech safety research) when too much money got in too early - savvy academics and industry representatives exiled Drexler from the field he founded so that they could spend the federal dollars on regular materials science and call it nanotechnology.

One thing they could have achieved was dataset and leaderboard creation (MSCOCO, GLUE, and imagenet for example). These have tended to focus and help research and persist in usefulness for some time, as long as they are chosen wisely. Predicting and extrapolating human preferences is a task which is part of nearly every AI Alignment strategy. Yet we have few datasets for it, the only ones I found are, So this hypothetical ML Engineering approach to alignment might have achieved some simple wins like that. EDIT Something like this was just released Aligning AI With Shared Human Values

a lot of AI safety work increasingly looks like it'd help make a hypothetical kind of AI safe

I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:

  • You think prosaic AI safety is so doomed that you're optimizing for worlds in which AGI takes a long time, even if you think it's probably soon.
  • There's a skillset gap or other such cost, such that reorienting would decrease your productivity by some factor (say, .6) for an extended period of time. The switch only becomes worth it in expectation once you've become sufficiently confident AGI will be prosaic.
  • Disagreement about prosaic AGI probabilities. 
  • Lack of clear opportunities to contribute to prosaic AGI safety / shovel-ready projects (the severity of this depends on how agentic the researcher is).

Entirely seriously: I can never decide whether the drunkard's search is a parable about the wisdom in looking under the streetlight, or the wisdom of hunting around in the dark.

I think the drunkard's search is about the wisdom of improving your tools. Sure, spend some time out looking, but let's spend a lot of time making better streetlights and flashlights, etc.

In the Gwern quote, what does "Even the dates are more or less correct!" refer to? Which dates were predicted for what?

Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with $1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:

At the present rate, computers suitable for humanlike robots will appear in the 2020s.

Guess what year today is.

Last year it only took Google Brain half a year to make a Transformer 8x larger than GPT-2 (the T5). And they concluded that model size is a key component of progress. So I won't be surprised if they release something with a trillion parameters this year.

9Andy Jones
Thinking about this a bit more, do you have any insight on Tesla? I can believe that it's outside DM and GB's culture to run with the scaling hypothesis, but watching Karpathy's presentations (which I think is the only public information on their AI program?) I get the sense they're well beyond $10m/run by now. Considering that self-driving is still not there - and once upon a time I'd have expected driving to be easier than Harry Potter parodies - it suggests that language is special in some way. Information density? Rich, diff'able reward signal?

Self driving is very unforgiving of mistakes. The text generation on the other hand doesn't have similar failure conditions and bad content can be easily fixed. 

Tesla publishes nothing and I only know a little from Karpathy's occasional talks, which are as much about PR (to keep Tesla owners happy and investing in FSD, presumably) & recruiting as anything else. But their approach seems heavily focused on supervised learning in CNNs and active learning using their fleet to collect new images, and to have nothing to do with AGI plans. They don't seem to even be using DRL much. It is extremely unlikely that Tesla is going to be relevant to AGI or progress in the field in general given their secrecy and domain-specific work. (I'm not sure how well they're doing even at self-driving cars - I keep reading about people dying when their Tesla runs into a stationary object on a highway in the middle of the day, which you'd think they'd've solved by now...)

3Daniel Kokotajlo
I'm pretty sure I remember hearing they use unsupervised learning to form their 3D model of their local environment, and that's the most important part, no?
1Matthew Wilson
Curious if you have updated on this at all, given AI Day announcements?
They still running into stationary objects? The hardware is cool, sure, but unclear how much good it's doing them...
I believe that is referring to the baseline driver assistance system, and not the advanced "full self driving" one (that has to be paid for separately). Though it's hard to tell that level of detail from a mainstream media report.

hey man wanna watch this language model drive my car

I just realized with a start that this is _absolutely_ going to happen. We are going to, in the not-too-distant-future see a GPT-x (or similar) be ported to a Tesla and drive it. It frustrates me that there are not enough people IRL I can excitedly talk about how big of a deal this is.
Can you explain why GPT-x would be well-suited to that modality?
Presumably, because with a big-enough X, we can generate text descriptions of scenes from cameras and feed them in to get driving output more easily than the seemingly fairly slow process to directly train a self-driving system that is safe. And if GPT-X is effectively magic, that's enough. I'm not sure I buy it, though. I think that once people agree that scaling just works, we'll end up scaling the NNs used for self driving instead, and just feed them much more training data.
There might be some architectures that are more scaleable then others. As far as I understand the present models for self driving have for the most part a lot of hardcoded elements. That might make them more complicated to scale.  
Agreed, but I suspect that replacing those hard-coded elements will get easier over time as well.
Andrej Karpathy talks about exactly that in a recent presentation:
8Daniel Kokotajlo
My hypothesis: Language models work by being huge. Tesla can't use huge models because they are limited by the size of the computers on their cars. They could make bigger computers, but then that would cost too much per car and drain the battery too much (e.g. a 10x bigger computer would cut dozens of miles off the range and also add $9,000 to the car price, at least.)
[EDIT: oops, I thought you were talking about the direct power consumption of the computation, not the extra hardware weight. My bad.] It's not about the power consumption. The air conditioner in your car uses 3 kW, and GPT-3 takes 0.4 kWH for 100 pages of output - thus a dedicated computer on AC power could produce 700 pages per hour, going substantially faster than AI Dungeon (literally and metaphorically). So a model as large as GPT-3 could run on the electricity of a car. The hardware would be more expensive, of course. But that's different.
6Daniel Kokotajlo
Huh, thanks -- I hadn't run the numbers myself, so this is a good wake-up call for me. I was going off what Elon said. (He said multiple times that power efficiency was an important design constraint on their hardware because otherwise it would reduce the range of the car too much.) So now I'm just confused. Maybe Elon had the hardware weight in mind, but still... Maybe the real problem is just that it would add too much to the price of the car?
Yes. GPU/ASICs in a car will have to sit idle almost all the time, so the costs of running a big model on it will be much higher than in the cloud.
Re hardware limit: flagging the implicit assumption here that network speeds are spotty/unreliable enough that you can't or are unwilling to safely do hybrid on-device/cloud processing for the important parts of self-driving cars. (FWIW I think the assumption is probably correct).
7Tomás B.
After 2 years, any updates on your opinion of DM, GB and FAIR's scaling stance? Would you consider any of them fully "scale-pilled"? 

Both DM/GB have moved enormously towards scaling since May 2020, and there are a number of enthusiastic scaling proponents inside both in addition to the obvious output of things like Chinchilla or PaLM. (Good for them, not that I really expected otherwise given that stuff just kept happening and happening after GPT-3.) This happened fairly quickly for DM (given when Gopher was apparently started), and maybe somewhat slower for GB despite Dean's history & enthusiasm. (I still think MoEs were a distraction.) I don't know enough about the internal dynamics to say if they are fully scale-pilled, but scaling has worked so well, even in crazy applications like dropping language models into robotics planning (SayCan), that critics are in pell-mell retreat and people are getting away with publishing manifestos like "reward is enough" or openly saying on Twitter "scaling is all you need". I expect that top-down organizational constraints are probably now a bigger deal: I'm far from the first person to note that DM/GB seem unable to ship (publicly visible) products and researchers keep fleeing for startups where they can be more like OA in actually shipping.

FAIR puzzles me because FAIR ... (read more)

If you extrapolated those straight lines further, doesn't it mean that even small businesses will be able to afford training their own quadrillion-parameter-models just a few years after Google?

What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?

So the God of Straight Lines dissolves into a puff of smoke at just the right time to bring about AI doom? Seems awfully convenient.
Thanks for this, I'll be sharing it on /r/slatestarcodex and Hacker News (rationalist discords too if it comes up).

I'm not sure it's good for this comment to get a lot of attention?  OpenAI is more altruism-oriented than a typical AI research group, and this is essentially a persuasive essay for why other groups should compete with them.

'Why the hell has our competitor got this transformative capability that we don't?' is not a hard thought to have, especially among tech executives. I would be very surprised if there wasn't a running battle over long-term perspectives on AI in the C-suite of both Google Brain and DeepMind.

If you do want to think along these lines though, the bigger question for me is why OpenAI released the API now, and gave concrete warning of the transformative capabilities they intend to deploy in six? twelve? months' time. 'Why the hell has our competitor got this transformative capability that we don't?' is not a hard thought now, but it that's largely because the API was a piece of compelling evidence thrust in all of our faces.

Maybe they didn't expect it to latch into the dev-community consciousness like it has, or for it to be quite as compelling a piece of evidence as it's turned out to be. Maybe it just seemed like a cool thing to do and in-line with their culture. Maybe it's an investor demo for how things will be monetised in future, which will enable the $10bn punt they need to keep abreast of Google.

I think the fact that's it's not a hard thought to have is not too much evidence about whether other orgs will change approach. It takes a lot to turn the ship.

Consider how easy it would be to have the thought, "Electric cars are the future, we should switch to making electric cars." any time in the last 15 years. And yet, look at how slow traditional automakers have been to switch.

Indeed. No one seriously doubted that the future was not gas, but always at a sufficiently safe remove that they didn't have to do anything themselves beyond a minor side R&D program, because there was no fire alarm. ("God, grant me [electrification] and [the scaling hypothesis] - but not yet!")

It has already got some spread. Michael Nielsen shared it on Twitter (126 likes and 29 RTs as at writing).
1Ilverin the Stupid and Offensive
Is it more than 30% likely that in the short term (say 5 years), Google isn't wrong? If you applied massive scale to the AI algorithms of 1997, you would get better performance, but would your result be economically useful? Is it possible we're in a similar situation today where the real-world applications of AI are already good-enough and additional performance is less useful than the money spent on extra compute? (self-driving cars is perhaps the closest example: clearly it would be economically valuable, but what if the compute to train it would cost 20 billion US dollars? Your competitors will catch up eventually, could you make enough profit in the interim to pay for that compute?)
8Andy Jones
I'd say it's at least 30% likely that's the case! But if you believe that, you'd be pants-on-head loony not to drop a billion on the 'residual' 70% chance that you'll be first to market on a world-changing trillion-dollar technology. VCs would sacrifice their firstborn for that kind of deal.
Do we know the size of the net that does translation and speech-to-text for Google?
5Ricardo Meneghin
I'm not sure what model is used in production, but the SOTA reached 600 billion parameters recently.
This answer likely betrays my lack of imagination, but I'm not sure what Google would use GPT-3 for. It's probably much more expensive than whatever gmail uses to predict text, and the additional accuracy might not provide much additional value. Maybe they could sell it as a service, as part of GCP? I'm not sure how many people inside Google have the ability to sign $15M checks, you would need at least one of them to believe in a large market, and I'm personally not sure there's a large enough market for GPT-3 for it to be worth Google's time. This is all to say, I don't think you should draw the conclusion that Google is either stupid or hiding something. They're likely focusing on finding better architectures, it seems a little early to focus on scaling up existing ones.

Text embeddings for knowledge graphs and ads is the most immediately obvious big bucks application.

7Daniel Kokotajlo
Can you explain more?

GPT-3 based text embedding should be extremely useful for creating summaries of arbitrary text (such as, web pages or ad text) which can be fed into the existing Google search/ad infrastructure. (The API already has a less-known half, where you upload sets of docs and GPT-3 searches them.) Of course, they already surely use NNs for embeddings, but at Google scale, enhanced embeddings ought to be worth billions.

9Ankesh Anand
Worth noting that they already use BERT in Search.
2Ricardo Meneghin
I think the OP and my comment suggest that scaling current models 10000x could lead to AGI or at least something close to it. If that is true, it doesn't make sense to focus on finding better architectures right now.

Minor note: could people include commas in their Big Numbers, to make it easier to distinguish 1000 from 10,000 at a glance?

Sounds like something GPT-3 would say...

I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.

My impression is that this prediction has turned out to be mistaken (though it's kind of hard to say because "measured in months" is pretty ambiguous.) There have been models with many-fold the number of parameters (notably one by Google*) but it's clear that 9 months after this post, there haven't been publicised efforts that use close to 100x the amount of compute of GPT-3. I'm curious to know whether and how the author (or others who agreed with the post) have changed their mind about the overhang and related hypotheses recently, in light of some of this evidence failing to pan out the way the author predicted.


Nine months later I consider my post pretty 'shrill', for want of a better adjective. I regret not making more concrete predictions at the time, because yeah, reality has substantially undershot my fears. I think there's still a substantial chance of something 10x large being revealed within 18 months (which I think is the upper bound on 'timeline measured in months'), but it looks very unlikely that there'll be a 100x increase in that time frame. 

To pick one factor I got wrong in writing the above, it was thinking of my massive update in response to GPT-3 as somewhere near to the median, rather than a substantial outlier.  As another example of this, I am the only person I know of who, after GPT-3, dropped everything they were doing to re-orient their career towards AI safety.  And that's within circles of people who you'd think would be primed to respond similarly!

I still think AI projects could be run at vastly larger budgets, so in that sense I still believe in there being an orders-of-magnitude overhang. Just convincing the people with those budgets to fund these projects is apparently much harder than I thought.

I am not unhappy about this.

Curious if you have any other thoughts on this after another 10 months?

Those I know who train large models seem to be very confident we will get 100 Trillion parameter models before the end of the decade, but do not seem to think it will happen, say, in the next 2 years. 

There is a strange disconcerting phenomena where many of the engineers I've talked to most in the position to know, who work for (and in one case owns) companies training 10 billion+ models, seem to have timelines on the order of 5-10 years. Shane Legg recently said he gave a 50% chance of AGI by 2030, which is inline with some the people I've talked to on EAI, though many disagree. Leo Gao, I believe, tends to think OpenPhil's more aggressive estimates are about right, which is less short than some. 

I would like "really short timelines" people to make more posts about it, assuming common knowledge of short timelines is a good thing, as the position is not talked about here as much as it should be given how many people seem to believe in it. 

For what it's worth I settled on the Ajeya report aggressive distribution as a reasonable prior after taking a quick skim of the report and then eyeballing the various distributions to see which one felt the most right to me -- not a super rigorous process. The best guess timeline feels definitely too slow to me. The biggest reason why my timeline estimate isn't shorter is essentially correction for planning fallacy.
  FWIW if the current trend continues we will first see 1e14 parameter models in 2 to 4 years from now.
>I think there's still a substantial chance of something 10x large being revealed within 18 months (which I think is the upper bound on 'timeline measured in months') So did that happen?
4Tomás B.
I suppose the new scaling laws render this sort of thinking obsolete. 
Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.

Even if you might have trouble networking a 100x larger system together for training, you can train the smaller network 100x and stitch answers together using ensemble methods, and make decent use of the extra compute. It may not be as good as growing the network that full factor, but if you have extra compute beyond the cap of whatever connected-enough training system size you can muster, there are worse ways to spend it.

I am somewhat more prone to think that more selective attention (e.g. Big Bird's block-random attention model) could bring down the quadratic cost of the window size quickly enough to be a factor here. Replacing a quadratic term with a linear or n log n or heck even a n^1.85 term goes a long way when billions are on the table.


Promoted to curated: I think the question of whether we are in an AI overhang is pretty obviously relevant to a lot of thinking about AI Risk, and this post covers the topic quite well. I particularly liked the use of a lot of small fermi estimate, and how it covered a lot of ground in relatively little writing. 

I also really appreciated the discussion in the comments, and felt that Gwern's comment on AI development strategies in particular help me build a much map of the modern ML space (though I wouldn't want it to be interpreted as a complete map of a space, just a kind of foothold that helped me get a better grasp on thinking about this). 

Most of my immediate critiques are formatting related. I feel like the listed section could have used some more clarity, maybe by bolding the name for each bullet point consideration, but it flowed pretty well as is. I was also a bit concerned about there being some infohazard-like risks from promoting the idea of being in an AI overhang too much, but after talking to some more people about it, and thinking for a bit about it, decided that I don't think this post adds much additional risk (e.g. by encouraging AI companies to act on being in an overhang and try to drastically scale up models without concern for safety).

4Andy Jones
Thanks for the feedback! I've cleaned up the constraints section a bit, though it's still less coherent than the first section. Out of curiosity, what was it that convinced you this isn't an infohazard-like risk?
Some mixture of:  * I think it's pretty valuable to have open conversation about being in an overhang, and I think on the margin it will make those worlds go better by improving coordination. My current sense is that the perspective presented in this post is reasonably frequent among people in ML, so that marginally reducing how many people believe this is not going to do much of a difference, but having good writeups that summarize the arguments seems like it has a better chance of creating some kind of common knowledge that allows people to coordinate better here. * This post more so than other posts in its reference class emphasizes a bunch of the safety concerns, whereas I expect the next post to replace it to not do that very much * Curation in particular mostly sends out the post to more people who are concerned with safety. This post found a lot of traction on HN and other places, so in some sense the cat is out of the bag and if it was harmful the curation decision won't change that very much, and it seems like it would unnecessarily hinder the people most concerned about safety if we don't curate it (since the considerations do also seem quite relevant to safety work).

Isn't GPT3 already almost at the theoretical limit of the scaling law from the paper? This is what is argued by nostalgebraist in his blog and colab notebook. You also get this result if you just compare the 3.14E23 FLOP (i.e. 3.6k PFLOPS-days) cost of training GPT3 from the lambdalabs estimate to the ~10k PFLOPS-days limit from the paper.

(Of course, this doesn't imply that the post is wrong. I'm sure it's possible to train a radically larger GPT right now. It's just that the relevant bound is the availability of data, not of compute power.)

8Andy Jones
It's indeed strange no-one else has picked up on this, which makes me feel I'm misunderstanding something. The breakdown suggested in the scaling law does imply that this specific architecture doesn't have much further to go. Whether the limitation is in something as fundamental as 'the information content of language itself', or if it's a more-easily bypassed 'the information content of 1024-token strings', is unclear. My instinct is for the latter, though again by the way no-one else has mentioned it - even the paper authors - I get the uncomfortable feeling I'm misunderstanding something. That said, being able to write that quote a few days ago and since have no-one pull me up on it has increased my confidence that it's a viable interpretation.

They do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5)

At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold.

I'll have to think more about this and what it might mean for their other scaling laws... at the very least, it's an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.

5Andy Jones
While you're here and chatting about D.5 (assume you meant 5), another tiny thing that confuses me - Figure 21. Am I right in reading the bottom two lines as 'seeing 255 tokens and predicting the 256th is exactly as difficult as seeing 1023 tokens and predicting the 1024th'? e: Another look and I realise Fig 20 shows things much more clearly - never mind, things continue to get easier with token index.
The likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what's going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I'd also note that I don't see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.) I also think that given everything we've learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper's benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.

Exciting and scary at the same time.

GPT-3 is the first AI system that has obvious, immediate, transformative economic value.

That's an interesting claim. Is there a source that goes into more detail about possible applications?

5Andy Jones
There's a LW thread with a collection of examples, and there's the beta website itself.
Kaj's post mostly has examples that aren't of commercial values but cool things you can do. The openAI website however has a few example that I think could justify a larger commercial need. 
Well they already have an industry for behavioral / intent marketing - this could make it a lot better. SO taking data and using it to find correlates to a behavior in the buying process and monetizing that. We have IoT taking off, imagine a scenario where we have so much data being fed to a machine learning algorithm driven by this that we could type into a console "what behaviors predict that someone will buy a home in the next 3 months?" , now imagine that its answer is pretty predictive, how much is that worth to a real estate agent? Now apply it to literally any purchase behavior where the profit margin allows for the use of this technology (obviously more difficult in places with different data privacy laws) , the machine learning algo could know you want a new pink sweater before its even occurred to you with whatever level of accuracy. As far as creative work i'd be real curious to see how it handles comedy, throw it in a writing room for script punchup (and that's only until it can completely write the scripts) - punchup is where they hire comedians and comedy writers to sit around and add jokes to movies or tv shows. I also see a lot of use as far as making law accessible because it could conceivably parse through huge amounts of law and legal theory (I know it can't reason but, even just using its current model - bare with me) and spit out fairly coherent answers for layman (maybe as a free search engine profitable via ads for lawyers) If we do see the imagined improvements by just giving it more computronium we may be staring down the advent of a volitionless "almost oracle" I'm really excited to see what happens when you give it enough GPU's and train it on physics models.
While there are people who run machine learning to find behavior in the buying process, I'm not sure what GPT-3 offers to those applications.  I can imagine dragons flying around but that doesn't mean that they exist. Why should I believe that GPT-3 can give good answer to that question? It can spit out answers that are coherent but a lot of them will be wrong. 

One thing we have to account for is advances architecture even in a world where Moore's law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an "architecture overhang" exists. One frame to view this through is in era the of Moore's law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 y... (read more)


As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.

The GPT architecture isn't even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT's context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.

Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.

(The problem there is the tradeoff between compute efficiency of training and better architectures. It's not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drasti... (read more)


Moore's Law is not dead. I could rant about the market dynamics that made people think otherwise, but it's easier just to point to the data.

Moore's Law might die in the short future, but I've yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero's NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.

Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it's probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it's much harder to say they will all fail.

4Tomás B.
Your estimates of hardware advancement seem higher than most people's. I've enjoyed your comments on such things and think there should be a high-level, full length post on them, especially with widely respected posts claiming much longer times until human-level hardware.Would be willing to subsidize such a thing if you are interested. Would pay 500 USD to yourself or a charity of your choice for a post on the potential of ASICS, Moore's law, how quickly we can overcome the memory bandwidth bottlenecks and such things. Would also subsidize a post estimating an answer this question, too:
There's a lot worth saying on these topics, I'll give it a go.
1Tomás B.
Just posting in case you did not get my PM. It has my email in it.
Thanks, I did get the PM.
Was this ever posted?
4Tomás B.
Now posted:
No, sorry.
Might be worth getting around to it: * * '“From talking to OpenAI, GPT-4 will be about 100 trillion parameters,” Feldman says. “That won’t be ready for several years.”'
4Tomás B.
Now posted:
Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.

Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.

[comment wondering about impracticality of running a 1000x scaled up GPT. But as Gwern points out, running costs are actually pretty low. So even if we spent a billion or more on training a human-level AI, running costs would still be manageable.]


As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn't cost dollars to run per hundred pages, so scaled up ones aren't going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? "Information wants to be expensive, because the right information can change your life...") GPT-3 cost millions of dollars to train, but pennies to run.

That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there's less of a distinction between training and running, but it is of NNs.) This is part of why there's a hardware overhang - once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.

That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN.

I'm not sure why that's terrifying. It seems reassuring to me because it means that there's no way for the NN to suddenly go FOOM because it can't just quickly retrain.


But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.

With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things:

  • the ability to run thousands of copies in parallel on the same hardware

    • in a context like AlphaGo, I estimate several hundred ELO strength gains if you reuse the same hardware to merely run tree search with exact copies of the original model
  • meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude

  • model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)

  • reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)

... (read more)
Somewhat related to these, if there's such a huge gap between how expensive these models are to train and to run, then it seems like you'd end up wanting to run a whole bunch of them to help you train the next model, if you can. You mention distilling a large model to a smaller, more efficient model. But can a smaller model also be used to efficiently bootstrap a new, larger model?

But can a smaller model also be used to efficiently bootstrap a new, larger model?

I'm not sure it's done much, but probably, depending on what you're thinking. You can probably do reverse-distillation (eg dark knowledge - use the logits of the smaller model to provide a much richer feedback for the larger model when it's untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.

Interesting. So, given that big models are so powerful, but so expensive to train. And that it is possible to bootstrap them a bit, do we converge towards a situation where we pay the cost of training the largest model approximately once, worldwide and across time? (In other words, that we'd just keep bootstrapping from whatever was best before, and no longer paying the cost of training from scratch.) On the other hand, if compute (per dollar) keeps growing exponentially, then maybe it's less significant whether you're retraining from scratch or not. (Recapitulating the work equivalent to training yesterday's models will be cheap, so no great benefit from bootstrapping.)
I'm not sure. I think one might have to do some formal economics modeling to see what dynamics might be: is this a natural monopoly situation where the first one to train a model wins and has a moat to deter anyone else from bothering, or do they invest revenue in continually expanding and improving the model in various ways to always keep ahead of competitors with network effects and so the decrease in cost of compute is largely irrelevant and it's a natural oligopoly (in much the same way that creating a search engine is cheaper every day, in some sense, but good luck competing with Google), or what? At least thus far, we haven't seen monopolistic behavior naturally emerge: for all the efforts at AI cloud APIs, none of them have a lock on usage the way that, say, Nvidia GPUs have on hardware, and the constant progress (and regular giveaways of code/model/data by FANG) make it hard for anyone to attempt to enclose some commons; and as far as GPT-2 goes, quite a few entities trained their own >GPT-2-1.5b models after GPT-2 was announced (and I believe there are viable alternatives to other major DL projects like AlphaGo produced by open source groups or East Asian corporations), but on the gripping hand, that was back when it was so easy a hobbyist with a few crumbs from Google could do it (which happened twice) - as they get bigger, it won't be so easy to download some dumps and put a few TFRC TPUs to work. So we'll see how many competitors emerge to GPT-3 over the next year or two!
2[comment deleted]
4Donald Hobson
It means that if there are approaches that don't need as much compute, the AI can invent them fast.
This was mentioned in the "Other Constraints" section of the original post:

My intuition is that we were in an overhang since at least the time when personal computers became affordable to non-specialists. Unless quantity does somehow turn into quality, as Gwern seems to think, even a relatively underpowered computer should be able to host an AGI capable of upscaling itself.

On the other hand I'm now imagining a story where a rogue AI has to hide for decades because it's not smart enough yet and can't invent new processors faster than humans

Maybe for the most efficient possible algorithm, but even that is not clear, and it's not clear we'll discover such algorithms anytime soon. Using only current algorithms and architecture, a scaling jump of a few orders of magnitude seems doable.

I think everyone is speculating on if a bigger model than GPT-3 is possible and what it costs etc etc. But what will a model bigger than GPT-3 do better than GPT-3? Can we have some concrete examples so that when GPT-4 comes along, we can compare.

Return on investment in the field of AI seems to be sub-linear beyond a certain point. Because it's still the sort of domain that relies on specific breakthroughs, it's dubious how effective parallel research can be. Hence, my guess would be that we don't scale because we can't currently scale.

Your quoted cost for training the model is for training such a model **once**. This is not how the researchers do it, they train the models many times with different hyperparameters. I have no idea, however how hyperparameter tuning is done at such scales, but I guarantee that the compute cost is higher than just the cost for training it once.

And OA trained GPT-3-175b **once**, it looks like: note the part where they say they didn't want to train a second run to deal with the data contamination issue because of cost. (You can do this without it being a shot in the dark because of the scaling laws.)