...the authors’ floorplan solutions have been incorporated into the chip designs for Google’s next-generation artificial-intelligence processors. This means that the solutions are good enough for millions of copies to be printed on expensive, cutting-edge silicon wafers. We can therefore expect the semiconductor industry to redouble its interest in replicating the authors’ work, and to pursue a host of similar applications throughout the chip-design process.
My current guess is that this is not a big deal. Surely these AI-optimizations will result in something like 10% improvement in AI-training-FLOPS-per-dollar, not 100%+, so they won't really change timelines or anything else strategically important. And it won't even be 10% improvement every year from now on, but more like 10% this year, a further 5% next year, a further 2.5% the year after that, etc. as the low-hanging fruit from floorplan optimization is picked. OTOH, this plausibly will reduce the time it takes to design new chips by a lot... but I'd be surprised if that was the main bottleneck anyway. I would have thought ramping up production was the main bottleneck.
I know very little about the chip industry though. Anyone care to correct me?
Just made this account to answer this. Source: I've worked in physical design/VLSI and CPU verification, and pretty regularly deal with RTL.
TL;DR - You're right-- it's not a big deal, but it simultaneously means more and less than you think.
Jump to "What It Means" if you already understand the problem.
First, let me talk about about the purpose of floorplanning. The author's mention it a little bit, but it's worth repeating.
Placement optimizations of this form appear in a wide range of science and engineering applications, including hardware design, city planning, vaccine testing and distribution, and cerebral cortex layout.
Much like a city, an SoC (system-on-chip) has lots of agents that transfer data to each other. If a mayor has to get to city hall, the library, the post office, the locksmith, the school, the burger joint, etc., how do you best place the buildings to get the shortest path to each of them? Suppose suddenly the librarian wants to first go to school, then the post office, and also a burger because they have 20% off. How do you position that requirement along with the mayor's requirement? Do you prioritize the mayor? What if he wants a burger too? What if it's not guaranteed the number of paths the mayor will take before returning to city hall Etc. etc.
As you probably know, placement in general is an NP-complete problem. Tools for this exist, and/or you can do it manually, but much like city planning, it gets very complicated very fast. These tools (if you wanna sound cool, call them PnR tools (place-and-route)) take foooreeever to run (it's quite common to let a tool run for a week) and are critical in the holistic design lifecycle-- more on that later.
Enter this paper. Honestly, they don't do any revolutionary stuff-- CNNs, ReLu, weight adjustment-- or rather, it's revolutionary because it's applied to PnR for the first time that I've seen at least (which, in hindsight, is pretty obvious. Pulling up the GUI for the tool, it's literally just a grid, exactly like a city, with its own centers and everything. Still cool nevertheless).
Let's talk about results!
I don't know how to do tables in comments, so bear with the formatting-- here are the results for one test they did:
Note: I left out "Congestion" and "wire length" because those are metrics that tbh don't really matter
Total area (µm 2 )
Total power (W)
Don't worry what wns and tns exactly mean (here are a few resources). Just know that they are essentially a measure of how short a "path" is between "buildings". The smaller it is, the better, because it means our mayor can travel less distance to get his burger.
Area and power are relatively explanatory-- essentially, how big is your city + all the roads you've built, and how much energy does it take to run it all. Again, the smaller the better.
What It Means
These are good results! We've just built roads that are twice as short vs. our manual methods! (23.3 vs. 47.6). But, I want to provide my opinion for why it's even worse than you think (i.e., I don't even think it would provide a 1% increase in perf, much in the same way that increasing CPU GHz doesn't do that much-- it's inherently limited), but also much better
For why it's worse-- consider again city planning. Suppose we take this to the extreme and the burger joint, library, post office, etc are all literally inside the same building as City Hall (i.e. no roads exist). First, his arteries will certainly get clogged passing by a McDonalds, but ultimately-- How much performance/time saved does the mayor really save?
I would argue that, while it depends on how convoluted the city was initially, there's a limit to how much you can shrink the roads and place the buildings. While these planning efforts are very much important to strive for, it's not the real bottleneck.
Furthermore, what if this travel time was time simultaneously being well-spent already? For instance-- perhaps he checked his emails walking to the post office. Maybe he called his mother. Maybe he brought his meeting notes to practice a speech. The point is-- this travel time is not really saved: just reallocated.
Note: CPUs do this a lot, e.g. while a memory request is occurring, they just switch to do some other tasks. This is also (to vastly oversimplify) essentially why frequency scaling no longer had immense payoffs as it did 30 years ago.
Now that I've killed your enthusiasm, let me tell you why it's also better than you think with this quote.
We show that our method can generate chip floorplans that are comparable or superior to human experts in under six hours, whereas humans take months to produce acceptable floorplans for modern accelerators
I mentioned earlier that designers heavily rely on PnR tools not only prior to tapeout, but as tools to iterate over (e.g. can I mux this more efficiently? Do I really need this logic in the critical path? Can this "building" be shifted over? etc.) As these tools take longer as our designs become more complex, it ultimately results in a longer feedback loop-- again, a week sometimes-- and personally, I really like instant gratification, so it's definitely a bit annoying.
And this is why it's potentially better-- it's indicative of a step towards freeing up resources of what I feel is a massive cost to many semiconductor companies. Not just for better and tighter feedback loops, but because these PnR/physical design/EDA tool teams are massive. Like, hundreds of people sometimes. And these people ultimately have the final signoff for lots of tapeouts, and determine timelines for hardware companies.
Go 5 years in the future, and give them a tool that improves engineer productivity 100x? Honestly, that'd be insane. For me personally, but also for my colleagues. (Honestly, not sure what I'd do with that extra time. I currently just cook stuff while I'm blocked- :) )
So, that's why I think it's both better and worse than you think.
Circuit design is the main bottleneck for use of field-programmable gate arrays. If fully-automated designs become good enough, we could see substantial gains from having optimising compilers output a gate layout rather than machine code for an xPU or specific accelerator. We already have some such compilers, and this looks like a meaningful step towards handling non-toy-scale problems with them.
The main change here wouldn't be so much training speed - we already have TPUs etc. to accelerate current workloads, and fabricating a new design as ASICs rather than FPGA layouts takes months-to-years at scale - but rather the latency with which we can try out custom hardware for novel ML paradigms such as transformers. What is to transformers as TPUs are to CNNs? Specifically for novel tasks, this could be a 10x-1000x speedup, and 2x-50x speedup for existing workloads... though I understand they're bottlenecked more on data movement between nodes than compute.
TLDR: a small step in a high-long-term-impact trend.
(Source: while I'm not a hardware specialist, I've worked with the PyMTL team at Cornell on verification and validation of their Python-to-Verilog-to-silicon hardware design tools, followed high-level developments in custom compute hardware for around a decade, and worked on peta-scale supercomputing for a few years.)
In the short term: moderately big deal. The chip industry is currently in rather a lot of flux; Intel was supplanted as leader in transistor size by TSMC; Apple is running with their own chip designs; China's monopoly on rare earth mineral processing has come under scrutiny again. This has provoked a boom in new development as a consequence. Even a small improvement in the design and manufacture of these facilities weighs a lot; because the chip industry is so important and so centralized, moderately big deal is essentially the floor for any actual development within it.
In the long term: big deal. This is not an opinion shared by anyone else as far as I can tell, but it feels very clear to me that "people use ML for this application" is the threshold at which the hardware overhang is almost immediately accessible to AGI. At that point the AGI is literally an upgrade operation, as opposed to having to go through the entire process of converting a workflow to something an AI of any type at all can work on. To be more concrete: I expect any kind of AI-driven takeover to control all of the currently-uses-ML industries before taking over any that do not; and I expect that within currently-uses-ML industries the order will be determined largely by how saturated they are with tools of that kind.