The pace of progress, 4 years later

Veedrac

It has been a long four years since I wrote Moore's Law, AI, and the pace of progress, a post about the room we have for AI compute to scale. Late 2021 had given us a year to absorb GPT-3, the Scaling Hypothesis, and Meena, but was a year before ChatGPT hit the stage, and, well, you probably know the rest.

While four years isn't quite enough time to test the everything I claimed, there's plenty to cover. Let's jump in.

Disclaimer: I'm going to use AI assistance liberally for scanning data and documents, and don't want to commit to noting this at every point.

0. Preface: Popular conception

In late 2021 I quoted from Jensen Huang, or as I so quaintly put it then, Jensen Huang, CEO of NVIDIA, a GPU company. “Moore's Law has finished.” Huang still claims Moore's Law is dead.

More interestingly than some guy from some company you might have heard of, the International Roadmap for Devices and Systems was quoted as warning that certain limits would be hit. What do they say now?

Roughly, several specifics from the IRDS roadmap have been significantly delayed, which is put down in part to greater demand for 'advanced packaging'. If I understand it, the claim is that DTCO and increased integration have lessened the need for pitch scaling.

Pitch scaling may be slowing down due to DTCO innovations and advanced packaging.

They give backside power routing as an example.

Moving power routing to the back side of the chip enables substantial reduction in the area of the unit cell without shrinking any critical dimensions.

The roadmap the originally projected 8nm half-pitch in 2028 now puts it in 2033, though it does extend down a little further, to 7nm the following year. Personally this reminds me a lot of the continued delays to EUV, where the hard version of the scaling task was repeatedly put off in favor of effective short-term alternatives, like scaling up multiple-patterning.

It's not obvious to me if this is unduly generous to the progress made. Whether this is evidence that scaling hit a plateau, or whether it's evidence that we've kept finding low-hanging fruit and the roadmaps keep stretching longer, seems unanswered.

I structured my argument into these points:

Current data shows much stronger current-day device scaling trends than I had expected before I saw the data.
Claimed physical limits to device scaling often greatly undersell the amount of scaling that could be available in theory, both in terms of device size and packing density.
Even if scaling down runs out, there are plausible paths to significant economic scaling, or if not, the capital and the motivation exists to scale anyway.
The potential size of AI systems is effectively unbound by physical limits.

1. What the (new) data shows

This broke into several sub-claims that I'll tackle individually.

“Transistor scaling seems surprisingly robust historically.”

My transistor density plot previously showed leading density from Apple's M1 at a density of transistors/mm².

Our timing isn't great, with TSMC 2nm in high volume production but no chips released, and M1 being above-trend. Nonetheless, there is a stark lack of progress.

TSMC's 3nm 2023 node, and their later 3E revision, were barely bumps, bringing about a 35% density bump to the densest chips we have today.
Much of the failure to scale in TSMC's 3nm was down to SRAM scaling, or the lack thereof. WikiChip Fuse said in 2023, “We now know that the N3E SRAM bitcell is identical to N5”, and they ask: Did We Just Witness The Death Of SRAM?
We are below-trend, albeit not significantly more below-trend than we've frequently been before major node releases in the past.
Both TSMC and Intel claim their 2nm / 18A nodes will increase SRAM density, to 38 Mb/mm², which is shy of a 20% jump.

I think this evidence is compatible with both takes, but certainly favours Moore hitting issues. Logic is still scaling, and hardware is most certainly getting better, but if we're only getting 20% SRAM scaling after such a long gap, that's still running head-first into a major bottleneck.

Overall I think my summary was fairly solid, but I absolutely should have clocked how critical the SRAM wall was going to be.

“Compute performance on AI workloads should increase with transistor scaling.”

This is hard to judge in part because accelerator performance has increased by an absurd amount in the last few years, in part due to a specialization around putting lots of transistors into systolic arrays. Performance has increased so fast that the claim being measured is lost in the noise.

“Related scaling trends are mostly also following transistor density.”

I covered a few things, most notably interconnect bandwidth. Again, performance has increased so fast that the claim being measured seems lost in the noise. The ambitious targets I posted about seem achieved, if not exceeded.

Scaling out with strong interconnects has mostly outpaced multi-chip architectures for compute, but multi-chip architectures have indeed become ever more popular, and HBM stacking has continued to climb.

“DRAM is expensive and no longer scaling.”

Quick measurements of DRAM suggest scaling has continued at its slowed pace, around 2x/decade. The situation has been well-detailed by SemiAnalysis in The Memory Wall: Past, Present, and Future of DRAM.

While this might seem like it matches what I said, this is actually more scaling than I was expecting over this time period! DRAM is also more likely to survive than I expected due to an increasingly strong plan to move to 3D DRAM, which has a separate and much more aggressive scaling trend. Note that 3D DRAM is a monolithic process involving significant changes to the cells, distinct from HBM, which stacks separate more-standard DRAM dies.

“When trends stop, they seem to do so suddenly, and because of physical constraints.”

I don't think it's clear how well this claim has lived! It seems still largely true, but poking at the details, the details seem more important than I predicted. For example, while DRAM scaling has moved suddenly to a much slower scaling regime, it is still scaling reasonably steadily. The sudden halt to SRAM scaling could count here, but it's much too early to call its long-term behaviour.

2. There (still) is Plenty [more] Room at the Bottom

Again, this broke into several sub-points.

“IRDS roadmaps already predict enough scaling for significant short-term growth.”

We've already discussed how the IRDS roadmap was delayed, while in part displaced by other gains. We've also already discussed how IRDS roadmaps continue to show significant opportunity for short-term scaling. I'll leave it to you to interpret how this looks in retrospect.

“3D stacking can unlock orders of magnitude of further effective scaling.”

Still too speculative to judge.

“Memory has a large potential for growth.”

Two major changes have occurred in this time period:

AI has scaled up with an unprecedented hunger for high density random access memory.
The explosion of nonvolatile RAM technologies that were around 2021 seem to have all gone quiet, particularly after 3D XPoint died.

I remain confident that the underlying argument is sound. Memory can scale in principle, and if it's going to bottleneck us then humanity will look for ways around it. I think I qualified that part of the argument fine.

But I definitely underweighted ‘the most boringly normal possible solution to your problem is the most likely one’, particularly with respect to 3D DRAM. I also think I overestimated how much help it is to have ten prospective solutions to one problem. Novel nanotechnologies just have a really, really bad base rate, and while I was intentional in accounting for it, I still didn't account for it enough.

“Integrated systems for training can get very large.”

I regret writing this section with so much emphasis on Tesla's D1, aka. their failed custom chip for Dojo. That said, I think the text holds up pretty well, and from what I've heard, D1 failed for funnier reasons than the ones you'd guess.

Broadly, everyone in the industry is scaling up, a lot, in basically compatible ways to what I wrote, albeit mostly incrementally rather than jumping the whole distance at once. I also mentioned stacking memory on top of chips — unsurprisingly, given the time frame, people are mostly just using larger HBM stacks.

3. How much die could a rich man buy?

Only two sub-points this time.

“There exist plausible prospective technologies for making fabrication cheaper.”

This section is hard to grade because it's so forward-looking, but Canon delivers first nanoimprint lithography tool, and IRDS mentions the technology is “rapidly improving its performances in terms of defectivity, throughput and alignment.”

“Funding could scale, and that scale could buy a lot more compute than we are used to.”

Yeah.

4. And I would think 1,000 miles

So I really didn't think this thought experiment would have gradable components but the richest man in the world basically tweeted it as an actual plan for actual reality a few weeks ago. So, uh. Yeah.

4.1 (Quantum parenthetical)

Quantum computers keep progressing at a clip, but there's still a ways before they become practical, and the tail end of timelines seem less and less relevant to any useful models of the actual future as it will actually happen. I've already paid my Bayes points for daring to assign meaningful tail probabilities, so I don't think there's much more for me to learn.

5. End note

Optical

Optical keeps happening but it's unsurprisingly taking its time.

Energy efficiency

jacob_cannell made some excellent points about energy efficiency in the comments, and then later wrote an exceptional post readers might find of interest, Brain Efficiency: Much More than You Wanted to Know (I caution readers that while technically excellent, it overstates the domain it applies to, and so claims more than it should).

I mostly visit this point because of jacob's concrete prediction:

I predict that for standard available GPUs/TPUs/etc (irreversible parallel von-neumann machines), about 65% chance we can squeeze about 10x more ops/J out by 2028 (Moravec's prediction of AGI), and only about a 10% chance we can squeeze out about 100x more ops/J.

I claimed to be more pessimistic than that. That seems gradable. Alas, this is not all that easy to grade given the chips have moved a larger fraction of compute to smaller bit widths.

The definitely-impartial not-at-all-gameable Claude estimates that the growth to today has been ~2.9x in FP16 and INT8, and claims flat extrapolation gives roughly 8x by 2028.

That's all, folks

If you found this interesting, go write your own retrospective on your own post.

[-]anaguma2mo10

That said, I think the text holds up pretty well, and from what I've heard, D1 failed for significantly funnier reasons than the ones you'd guess.

Can you say more?

[-]Veedrac2mo30

I edited out the word 'significantly', which in retrospect was misleading.

I'd prefer not to repeat what I've heard. In case I'm making this sound more mysterious than it is, I will note that you're not missing out on any juicy gossip. Nothing I heard in passing would be material to much.

LESSWRONG
LW