Cross-posted here on the EA Forum.
Transformative AI and Compute - A holistic approach - Part 4 out of 4
This is the Appendix (part four) of the series Transformative AI and Compute - A holistic approach. You can find the sequence here and the summary here.
This appendix attempts to:
You can find the previous post "Compute Governance and Conclusions [3/4]" here.
This appendix lists research questions, some thoughts on metrics related to compute, and a list of AI hardware startups.
These paragraphs ended up in the Appendix, as I have spent less time on them and it is more of a draft. Nonetheless, I still think they can be potentially useful.
This is not a research agenda or anything close to it. This is me doing research and thinking “Oh, I’d be curious to learn more about this *scribbling it down*”.
The questions rank from specific to nonspecific, from important to “I’m just curious”. There are a couple of questions which I think are more important, I’ve marked them with a ★.
This list is also published as a Google Doc, so one can add comments to individual items.
In general, this domain of research requires more:
Figure A.1: Sketch of research domains for AI and Compute.
CSET is already doing most of the work within this domain. I expect my questions could be answered by reading through all of their material.
I have discussed various trends related to compute, and those trends often relied on metrics which we have then investigated over time. While metrics related to compute might initially seem more quantifiable than other AI inputs (such as talent, algorithms, data, etc.), I have encountered various caveats during my research. Nonetheless, I still think that compute has the most quantifiable metrics compared to the others.
In this Appendix B, I briefly want to present commonly used metrics, discuss some caveats, and propose some ideas to address those shortcomings.
The presented forecasting model for AI timelines is informed by one hardware-related metric: FLOPs/s per $. Algorithmic efficiency and money spent on compute are hardware independent. They are multiplicative with the FLOPs per $ (see Section 4.1).
We have the following options to make progress or acquire more compute. Assuming all the other metrics are constant, we can either:
While metrics, such as the price and energy usage, are more clearly defined and measurable, the computing performance is more than just FLOPs/s and depends on various factors which I will discuss.
I will present metrics connected to the three basic components: logic, memory, and interconnect.
Presented in Section 1, our atomic entity is the operation. FLOP refers to a floating point operation which is a type of operation on a specific number representation, a floating point number. Nowadays, this term is commonly used interchangeably, independent of the number representations.
However, it is of importance which number representation is used, as this defines the performance of the hardware. Commonly used in Machine Learning are float16, float32, bloat16, int8, and int16. For an introduction to number representations see here or here.
Also commonly used is Petaflop/s-day. It's also a quantity of operations. A petaflop/s is 1015 floating point operations per second for one day. A day has 84,400 seconds≈1015. That makes 1020 FLOPs.
A quantity of operations does not tell us how long it will take to process them. For a given quantity of operations (X FLOPs) executed in a given time, we have the metric: FLOPs per second (FLOPS or FLOPs/s). This is the most common metric used for the processing power of hardware or whole datacenters (e.g., see the Top 500 List).
If we now take the purchase price of the hardware into considerations, we can now divide the computing performance in FLOPs/s by the purchase price to learn about FLOPs/s per $. This puts different hardware into comparison and tells how many operations specific computing hardware can execute per second for a single dollar.
Note that this metric commonly ignores the operation costs of hardware that is partially defined by FLOPs/s per Watt.
FLOPs/s per Watt is similar to FLOPs/s per $. However, instead of the monetary costs it considers the energy costs. It describes the efficiency of the hardware.
Energy efficiency is a key metric in computer engineering, as hardware is often limited by its heat dissipation (see power wall). A higher energy efficiency often allows us to increase the clock frequency or other components which then allows us to compute more FLOPs/s.
The operation costs — here the energy cost — is often excluded in FLOPs/s per $ estimates, as the initial purchase price makes up most of the cost.
However, I think that considering all the other operational costs, such as cooling, system setup, system administration, etc, might change this picture and would be interested in estimates.
The memory capacity is described in Bytes or Bits and is familiar to most people. Most of the confusion is usually around the different types of memory: cache, on-board memory, RAM, HDD, SDD, etc.
One of the most important to concepts to understand is memory hierarchy. It separates different storage types in regards to their response time, bandwidth and capacity. The heuristic: low response time [in seconds] and high bandwidth [in Byte per second] leads to lower memory capacities.
Figure B.1: Memory hierarchy in computing systems (Source).
Why is the memory capacity of importance for the performance of the system? As already discussed in Section 1.1, your logic can be bottlenecked if it does not get supplied with enough data. Therefore, data locality is important. If you are training an AI system, you would like your weights and parameters to be as local as possible to enable fast read and write access. Therefore the onboard memory of your GPU is often important (as a minimum you should fit all of your parameters on it). Whereas it is theoretically possible to offload them to other memory systems, this comes with a significant increase in training time and makes it usually unfeasible.
We have introduced metrics related to the logic and the memory. To now move data from the memory to be processed in the logic element, we require an interconnect.
The memory bandwidth is described in Bytes per second, short B/s. As previously described, this memory bandwidth depends highly on the locality. Moving data from the on-board (e.g. called GPU memory) to be processed in the logic is orders of magnitude faster than, for example, if you need to communicate data from local on-board memory, to a different board via various interconnects.
To address the shortcomings of only measuring the computing capabilities of the logic elements (more in B.2), the measure “traversed edges per second (TEPS)” was introduced. It is a measure of the interconnect capabilities and computational performance.
This is an important measure, especially for high performance computing, as those datacenter consist of hundreds of individual computers and data needs to be moved around from computer to computer — communication through various levels of memory hierarchy.
The Graph500 is a rating based on this metric. AI Impacts also wrote a post on the cost of TEPS.
This is just a brief selection of relevant metrics. I’ve focused on those that I have seen commonly used in my references.
This section discusses some shortcoming of the presented metrics. I list the caveats in order of my felt importance (from more important to less important).
The FLOP/s on datasheets present peak performances of the hardware — what the hardware is able to process, assuming a perfectly balanced workload. Nonetheless, this is rarely the case in the real world. Our hardware is often only processing X FLOP/s of the theoretical achievable Y FLOPs/s. We refer to this as the utilization: XY.
The achieved utilization depends on various factors, I briefly discuss communication, parallelism, and software.
Also, we discuss some ideas around utilization in an upcoming piece on estimating compute (here it is).
"A supercomputer is a device for turning computational complexity into communication complexity."
"A supercomputer is a device for turning computational complexity into communication complexity."
— Beth's commented quote on Crox's post.
I have discussed in B.1 Interconnect, and Section 1, that the interconnect is crucial to the operations of the logic. If the logic is not supplied with enough data, or the interconnect cannot write the results fast enough, the logic is bottlenecked by the interconnect. This behavior is not captured by peak FLOPs/s.
How big of a deal is this? I think it’s a big deal. The memory-processor gap is well known within the computing community (see this, the roofline model, memory-bound, or compute-bound).
An example is the specifications from NVIDIA’s V100 to A100. The A100 is the succeeding generation. This generation steps focused next to doubling the tensor performance, also on doubling the on-board bandwidth but also the off-board interconnect. Whereas, not focusing on the FP32 or FP64 performance, only increasing it marginally.
AI hardware builds on parallelism on various layers. Parallelism in the chip architecture, but also up to having multiple computing systems where we distribute the workload.
However, for this we require a workload which is parallelizable. Most AI workloads are highly parallelizable and this is the reason why we switched from CPU to GPUs/TPUs. However, not every workload is perfectly adapted for the underlying hardware. For this reason the peak performance is rarely achieved, as all the components (such as network architecture, hyperparameters, etc.) would need to be matched to the underlying hardware.
This also means, 10 GPUs with each 10 TFLOPs/s do not equal 100 TFLOP/s. The parallelizability and the communication overhead reduces the achieved performance and tweaking the workload and distribution is necessary.
OpenAI discusses some ideas around parallelism in the piece: “An Empirical Model of Large-Batch Training”.
Software, such as compilers and libraries, helps us to translate our high-level languages into actual machine-readable code. This machine-readable code should then leverage the features of the hardware architecture, such as the parallel architecture. Examples of this are CUDA, Triton, or XLA. Ideally, this software helps with the above-outlined problem of parallelism and communication. However, this is hard and some problems are NP-hard or NP-complete. Consequently, the peak performance is rarely achieved.
Talking about the amount of FLOPs or FLOP/s without mentioning the number representation gives limited insights. Especially the bit width, such as float16, or float32, is important. Short bit-widths can be processed faster and also moved around faster round — as the corresponding in- and outputs are smaller in size. Already now, we see that integer processing comes with significant speedup and memory savings (but is mostly used for inference). For the future, I am assuming that we see more specialized number representations such as the bfloat from Google, emerging in the future.
Therefore, I would also criticize the term FLOP, as non-floating operations are already present. However, the term seems now to be used interchangeably with OPs (even though it is technically not correct).
Consequently, I think it is fair to say that we currently have a processing overhang. I refer to the capabilities of the logic — we can currently theoretically process more FLOPs per second than we can read and write given by our interconnect or memory. Therefore, improving the memory bandwidth —without increasing the processing performance— will lead to more computing capabilities.
For the moment, most estimates neglected the operations costs, such as energy and infrastructure. I think this is fair —at least in regards to energy— given some back-off-the-envelope calculations. Nonetheless, we should continue monitoring this. With potential longer innovation cycles and an economy of scale (discussed in Section 4.2), the purchase price might decrease and the operations costs might become a more significant proportion.
Additionally, I think engineering costs might not be neglectable (running a computing farm, setting it up, etc.). I’d be interested in estimates. To get a more complete picture, one could get some insights into those costs by estimating the costs of FLOPs/s per $ from cloud computing providers, such as AWS or Google Cloud (minus their premium for profit).
We describe a FLOP or OP as the atomic instruction which we count. However, on the computer architecture level this usually gets broken into multiple instructions — how many and what they are like depends on the instruction set architecture. This is the assembler code. This type of code is closer to the hardware and could provide us more reliable insights into the performance with metrics, such as instructions per cycle. However, this would require more work and gets into the weeds of computer engineering.
I think for this reason, FLOPs and multiply-accumulate operations (MACCs) often get conflated. Some use them interchangeably and assume one FLOP equals one MACC, whereas others assume a MACC equals two FLOPs.
In the end, a MACC is hardware-specific operations. If the underlying hardware consists of a fused-multiply-add unit, then a MACC is nearly as costly (in terms of latency) as a single FLOP. However, if it does not or the operations (multiply, and then addition) are not done consecutively, it equals two FLOPs.
Nonetheless, I don’t want to go into the details here. But assuming we got all of the above-mentioned caveats right, our metrics still have some limitations. I do not think that this is of importance given our current trend line where we see a doubling in training compute every six months. However, for my previous research on ML on the edge (optimize AI systems for the deployment on small resource-constrained embedded devices.) where our error bars need to be smaller than one order of magnitude, this played an important role. Additionally, often people get confused when they make their own estimates, and then numbers do not add up.
I have listed various caveats on commonly used metrics and under which circumstances they are of more or less use. Measuring AI hardware does not rely on a single metric that gives you all the information but that is rarely the case for any domain. Overall, I am giving the recommendation to adjust hardware performances with utilization rates and monitor developments in the memory bandwidth more closely.
We are working on a piece with more insights on the utilizations and some advice on how to estimate training compute and the connected utilization of the system (link to be added by the end of 2021; ping me if not).
I’m also expecting that measuring those metrics will get more complicated over time due to the emergence of more heterogeneous architectures, and we might see more specialized architectures for different AI workloads.
Consequently, I am interested in metrics that are workload-dependent and can give us some insights into the performance of AI hardware.
MLCommons is a project which lets hardware manufactures published their training times for different AI domains. There we find training times for different hardware, hardware setups, and different AI workloads: e.g, image recognition and NLP.
Analyzing this data would allow us to analyze trends based on the training time which already encapsulates listed caveats such as the utilization, workload-dependent parallelizability, and memory access. For example, we could calculate performance gains for real-world workloads of hardware over time (unfortunately, the data only goes back to 2017). Additionally, we get more insights into closed systems such as Google’s TPU.
I have made this data available in this sheet and would be interested in some analysis.
The GPU data from Lambda Labs is another promising dataset.
This is a list of AI hardware startups that are working on new types of hardware paradigms, often optical computing (as a hybrid approach between digital and optical).
Figure C.1: Overview from June 2020 by @draecomino (Tweet Source).
List of AI hardware startups I stumbled upon during my research:
You can find the acknowledgments in the summary.
The references are listed in the summary.
An overhang refers to a situation where large amounts of resources that are already available cannot be used yet. If it is resolved, large amounts are unlocked immediately (Dafoe 2018). ↩︎
As already discussed in Section 4.2, even when hardware does not significantly increase in computing performance, the price can still decrease significantly due to longer R&D cycles and an economy of scale. ↩︎
See this Wikipedia List or this post by AI Impacts for historic trends of FLOPs/s per $. ↩︎
For example in our discussed timeline forecast, this Wikipedia List, or this post by AI Impacts. ↩︎
A quick back-off-the-envelope calculations: A NVIDIA A100 consumes around 6.5kW at peak usage. Assuming 0.12$ per kWh, it costs around 6,800$ per year for running this hardware. This is rather neglectable to the purchase price, given that an NVIDIA A100 costs around $200,000 to $300,000. ↩︎
For an NVIDIA A100, the on-board memory bandwidth is around 2GB/s, whereas interconnect with additional A100’s using NVIDIA'S specialized NVLINK, one achieves up too 600 GB/s. And only 64GB/s using the standard PCIe Gen4 interface (see this datasheet). ↩︎
NVIDIA V100 datasheet ↩︎
NVIDIA A100 datasheet ↩︎
Class on Advanced Compiler Optimization ↩︎
Integer representation (instead of floating point), saves energy and requires less space on the chip die. See Computing’s Energy Problem - Slides and Computing's energy problem (and what we can do about it) ↩︎