The case for fine-grained tracking of compute for AI

Farhan; Katherine Biewer

TL;DR

Current approaches to tracking AI compute primarily rely on a handful of hardware proxies (like FLOP/s and bandwidth) that primarily track GPU progress. These metrics are becoming less useful for accurately tracking compute for AI because they (1) measure theoretical ceilings rather than actual performance, (2) as architectures diversify away from a GPU/TPU-dominant paradigm, the metrics are becoming less comparable across different architecture types and less likely to follow historical trends, and (3) they miss second-order effects from improving design and manufacturing processes. We argue that compute scaling is no longer well-described by plotting trend lines of imperfect proxies, and it will depend more on which accelerator paradigm(s) take off. Rather than collapsing compute scaling into a single measure agnostic of architecture, we propose to build a framework that monitors and measures compute advancements by developing compute-scaling models for each type of architecture. These scaling models will track the bottlenecks that each architecture faces (and the innovations that overcome them) across three layers: architecture design, design and manufacturing processes, and deployment. Through this framework, we aim to monitor and predict AI compute scaling more accurately, anticipate potential parallel paths to capability jumps, and capture practical compute performance available to deployed AI.

Introduction

For the past few years, AI compute progress has been tracked using a handful of hardware proxies, e.g., FLOP/s^[1], memory bandwidth^[2], power, fab capacity, etc. Plotting trend lines of these proxies reflected hardware progress reasonably well when progress mostly occurred within a relatively stable paradigm and therefore improvements in peak metrics tended to correlate with improvements in actual hardware capability.

However, as AI hardware diversifies and we see innovations (in computing paradigms, memory architecture, programming models, deployment optimization, efficiency, etc), these metrics are becoming increasingly insufficient on their own to tell us how much compute is available to AI systems or how quickly that compute is likely to scale.

In this article we describe how we think current compute modeling is failing, and what we think we should do instead.

Current modeling gives an incomplete and inaccurate picture of compute capability

We think there are three main problems with how compute is currently modeled:

Tracking peak metrics (like peak FLOP/s, maximum bandwidth) only tells us about a chip’s theoretical performance ceiling^[3], and not how much useful work it actually does. It is not sufficient to just track a chip’s performance alone. We must also track how the chip is deployed.
As hardware architectures diversify and diverge from the GPU paradigm, the metrics we track are becoming inconsistent units of comparison across accelerators, and we can no longer assume future hardware improvements will follow historical trends. Furthermore, compute capacity scaling will not be a function of advancements in a single paradigm. Instead, it will depend on which accelerator paradigm(s) take off and which bottlenecks are overcome.
We are missing second-order effects from improvements in design and manufacturing processes (i.e., advancements that don't improve a chip's specifications directly, but change the rate at which chips improve, like compressing design cycle time, or expanding the design space through AI-assisted tools).

We need to track utilization, not theoretical performance

Most of the compute we are tracking in our models is not being used. When we track FLOP/s, we are tracking a theoretical ceiling of performance, and not how much actual performance we are getting out of the system. The gap between theoretical and achieved performance is currently large. Model FLOP Utilization (MFU), which measures the fraction of a chip's peak FLOP/s that is actually used, typically ranges from 35–50% for well-optimized workloads, and can be as low as 10–20% for less-optimized workloads^[4].

The reason this gap between theoretical performance and actual performance exists is because how much work a chip actually does depends on the full software and infrastructure stack around it. There are several supporting infrastructure configurations and processes involved in putting a trained model into an operational environment. Each of these factors (e.g., the inference stack, the scale-out fabric connecting nodes, kernel optimization, compiler improvements, and utilization strategies) has varying degrees of impact on the practical performance extracted from the underlying accelerators. Existing monitoring mechanisms are largely agnostic to deployment parameters; however, in practice, the same accelerators deployed in two different configurations can yield entirely different compute capabilities.

Consider disaggregated inference, used by NVIDIA in their Rubin CPX platform, reported in September 2025. In a conventional GPU cluster, every GPU runs the full model in the same way. NVIDIA's disaggregated architecture splits the workload: one set of GPUs handles the memory-intensive phase, while another handles the compute-intensive phase. Each GPU can now be optimized for its portion of the workload rather than compromising on both. The result is significantly higher utilization of available compute and memory bandwidth, without any change in the peak FLOP/s or bandwidth.

Hardware is diverging from the existing paradigm

AI hardware is no longer restricted to the von Neumann architecture^[5]. Beyond GPUs and their close analogs (tensor processors like Google's TPU or AWS Trainium), we are seeing different approaches to AI hardware designs that try to overcome the von Neumann bottleneck^[6], including SRAM-based architectures (Groq's LPU, Cerebras' wafer-scale engine), memory-centric compute (d-Matrix), tensor contraction processors (FuriosaAI), dataflow architectures (SambaNova RDU), and application-specific designs that hard-wire particular model architectures directly onto silicon (Etched, Taalas).

Several of these innovations are already deployed in production or have been acquired by major design firms^[7]^[8]. This divergence creates two problems for existing approaches to tracking compute: (1) it makes current metrics incomparable across architectures, and (2) it means future hardware improvements are increasingly unlikely to follow historical trend lines.

Existing metrics for measuring compute are becoming incomparable across architectures

The standard metrics (e.g., FLOP/s and bandwidth) for tracking hardware performance are becoming inconsistent units of comparison across architectures as they diversify (in memory architectures, numerical precision formats, compute primitives, programming models, scale-out interconnects, etc.).

FLOP/s (operations)

A FLOP is becoming an inconsistent unit of comparison across accelerators. What counts as one operation differs across architectures (different precisions, different primitives, etc). Comparing a FLOP across accelerators is comparing different things.

For instance, FuriosaAI's RNGD accelerator replaces standard matrix multiplications with tensor contractions, a higher-level primitive. The impact of this difference requires a deeper understanding of the tensor computations, however, simply put, a tensor contraction FLOP performs meaningfully more computation per unit of data than a single FLOP on a GPU performing 2D matrix multiplication. Comparing their FLOP/s figures side by side is comparing different units of work. Similarly, Etched's Sohu accelerator hard-wires transformer operations directly into silicon, and its operations don't decompose into the same primitive matrix multiplications that GPU FLOP/s counts assume^[9].

Bandwidth

Current tracking uses HBM (High Bandwidth Memory) bandwidth as the primary measure of how fast a chip can feed data to its processors. Tracking HBM bandwidth assumes that a key memory bottleneck is the speed at which data can be moved from off-chip memory to the processor. This is relevant for GPUs and TPUs, where HBM is the main pathway between memory and compute.

However, not all architectures use HBM. SRAM-based designs (like Groq's LPU and Cerebras' wafer-scale engine) build memory directly onto the chip rather than relying on a separate memory module. The Groq LPU has a total of 0 HBM bandwidth, yet it can move data at 150 TB/s, which is roughly 15x that of an NVIDIA B200.

The following chart, from Epoch AI, tracks the total AI chip memory bandwidth shipped over time, which has grown 4.1x per year. This chart is irrelevant for SRAM-based designs. For these kinds of architectures, HBM is not a bottleneck and tracking it tells us very little about their performance and how they’ll scale.

We can no longer just extrapolate trend lines

When hardware for AI was dominated by a single architecture class (of GPUs) that primarily improved along predictable trends (such as process-node improvements, higher memory bandwidth), extrapolating trend lines was a reasonable approach to forecasting. However, this is no longer the case.

Multiple architectural paradigms are now advancing in parallel, each with different bottlenecks and different implications for scaling. It is no longer sufficient to extrapolate a single trend line for compute growth when we are seeing innovations in SRAM-based designs, wafer-scale engineering, memory-centric compute, and hard-wired ASICs. Each of these could drastically improve how much useful compute we get out of a system, if its particular bottlenecks are resolved.

As just one example, architectures utilizing memory-centric compute, like the d-Matrix accelerator, have the potential to significantly decrease total power dissipation, and as a result, significantly decrease the total power requirement for an AGI-level compute. Within current modeling, we use total power requirement as a roofline for compute scaling, however, memory-centric compute technologies can significantly increase the compute capacity in the available power budget. This potential decrease in power budget requirements is not accounted for in existing monitoring approaches.

For instance, the following graph from the AI Futures project shows the expected power budgets required for compute scaling for a hypothetical leading AI lab (blue line), compared with total power capacity globally and in the US. This model is “based on reported and projected power efficiency of the key Nvidia chips [expected] to be in use”, without accounting for cases where a novel accelerator architecture might decrease the expected power budgets and, as a result, evade any policy measures based on this forecasting^[10].

We are missing second-order effects from improving design and manufacturing processes

Beyond the hardware architecture itself, we're also seeing lots of innovation in hardware design and manufacturing processes. Advancements in hardware design flows and manufacturing can have second-order effects on the improvement rate of AI accelerators.

Design

EDA^[11] companies (like Synopsys, Siemens, and Cadence) and chip design firms (like NVIDIA) are developing better tools and integrating AI into their toolchains^[12]. We are also seeing a massive amount of investment in startups focused on building AI tools and agents for hardware design^[13]. Ricursive Intelligence is the most notable example. The company, now valued at $4 billion, has raised $335 million to build a platform that automates the full chip design process, with the goal of compressing design cycles and enabling tighter hardware-software co-design^[14].

This level of investment indicates that the industry meaningfully expects AI to improve hardware design and compress hardware design timelines. In a conversation at GTC 2026, NVIDIA's chief scientist, Bill Dally, told Google's chief scientist, Jeff Dean, about how NVIDIA are using AI within their design flows. As just one example, Dally said that porting NVIDIA's standard cell library to a new process node used to take a team of eight people about 10 months, and their reinforcement learning tool, NVCell, can now do it overnight on a single GPU, producing results that match or exceed the human designs on size, power, and delay. Dean himself has also said he expects AI to compress the full chip design cycle from the current ~18 months to weeks.

If AI-assisted design can achieve this, it will change the economics of building specialized hardware. Architectures that are currently expensive and slow to develop become cheaper and faster to iterate on, leading to faster improvement cycles and making hardware-software co-design cheaper and quicker, and therefore more viable.

Manufacturing

Innovations in how chips are fabricated and packaged don’t directly change a chip's specs, but determine which architectures can be built, how many can be produced, and how quickly new designs can reach deployment. Dramatically new architectural designs often require new manufacturing processes before they can be viable in the real world. For example, the Cerebras Wafer Scale Engine was only possible because TSMC developed a custom process to connect dozens of dies into a single wafer-sized chip. Similarly, the packaging processes used to assemble high-end AI accelerators have been a persistent production bottleneck, and advances in packaging capacity and technique directly affect how many chips reach deployment.

Proposed alternative

A more accurate way to think about hardware advancement is not only as a function of a few metrics (like performance, bandwidth, power, fabrication capacity, etc), but as a series of innovations that overcome bottlenecks at each of the following layers:

The chip architecture design (computing paradigm, memory architecture, numerical format, compute primitives).
The design and manufacturing processes (integration of AI into hardware design flows, HW/SW co-design, manufacturing advancements).
The deployment of the chip (inference stack, kernels, scale-out infrastructure, utilization, tokenomics, etc).

Each architecture class has a different set of binding constraints across these layers. For example, Cerebras' wafer-scale engine is partly constrained by TSMC's ability to produce wafer-scale interconnects at volume. Groq's LPU is partly constrained by SRAM density and cost, and by the massive scale-out needed to fit a large model. GPUs are most constrained at the deployment layer by utilization efficiency and a mismatch of performance between training and inference. Abstracting away nuances applicable to several different architectures with a handful of metrics is becoming increasingly inaccurate.

Rather than extrapolating trend lines for compute growth across a few metrics, we propose modeling scaling trajectories for each architecture class, gated by the bottlenecks that currently constrain each one. In practice, this might involve maintaining profiles for each architecture that document the current binding constraints at each of layer (architecture design, design and manufacturing, and deployment), track the innovations and investments targeting those constraints, model what happens to an architecture's trajectory when a specific bottleneck is resolved, and model what this implies for AI progress.

We are in the early stages of developing and fleshing out this framework, and expect to publish a detailed methodology and analysis as it progresses.

Why this matters

Compute thresholds are central to current AI governance. The EU AI Act, US executive orders, and export controls use metrics like FLOP/s and bandwidth to determine reporting requirements and inform capability forecasts. If those metrics no longer accurately reflect real-world capability (e.g., because peak performance says nothing about utilization, or because the parameters being controlled are not relevant for tracking novel accelerators), the regulatory mechanisms that use them become less effective. Systems trained on novel accelerators could deliver equivalent or greater capability while being below reporting thresholds, and capability forecasts that are fitted to a single paradigm may fail to account for step-changes from alternative architectures that resolve different bottlenecks. It is important to accurately measure, monitor, and analyze compute for compute governance mechanisms to function as intended.

Who are we?

Dr. Farhan Shafiq holds an MS in SoC (System on Chip) design from KTH, Stockholm, and a PhD in performance modeling of multi-processor SoC from the Institute of Science, Tokyo. He has 15 years of experience in the semiconductor industry across compilers, EDA tools, modeling, accelerator design, and semiconductor backend processes. His previous AI Safety work consists of authoring a paper on compute governance (Building the Pause Button), AI chip location verification mechanisms, evaluating autonomous AI systems and governance mechanisms for AI in education.

Katherine Biewer is a researcher focused on AI safety and compute governance. She previously completed a contract with the UK AI Security Institute (via Arcadia Impact) to design and develop evaluations for autonomous AI systems (including evaluating AI capabilities in hardware design tasks) and has developed curriculum materials teaching policymakers about the semiconductor supply chain and AI hardware. She has a background in software engineering and holds a BSc in Science from Monash University.

We are currently building the framework described above into a sustained research effort. Our work to date has been funded by Coefficient Giving.

^{^}
Peak FLOP/s (or peak theoretical FLOP/s) is a metric used in specification sheets to describe the maximum theoretical performance capabilities of CPU and GPU devices. FLOP/s are the floating-point operations per second.
^{^}
Memory bandwidth refers to the rate at which data can be read from or written to the memory of a computer system. It impacts the performance of applications by determining how quickly data can be accessed and processed.
^{^}
The theoretical ceiling of performance, or a performance roofline, is a measure of the total hardware resource available on a chip. In the case of AI accelerators, typically two types of rooflines are considered: the “compute roofline” and the “memory bandwidth roofline”. It is a simple way to compare application performance against machine capabilities based on the arithmetic intensity of an algorithm.
^{^}
Meta reported 38–43% MFU for Llama 3.1 405B training. CoreWeave reports exceeding 50% MFU on NVIDIA Hopper GPUs, noting that public benchmarks typically range from 35–45%.
^{^}
Von Neumann architecture is a computer design where the computer system is made up of three main components: CPU, Memory, I/O Devices. The clear separation between the processing and the memory (data and instruction) is the defining feature. This separation is being re-imagined in recent accelerator design.
^{^}
In a von Neumann architecture, processing and memory are physically separate, so data must be shuttled back and forth between them. As AI models grow larger, this bottleneck worsens as more data means more shuttling, and the processor spends more time waiting for data than doing useful work.
^{^}
NVIDIA (whose business is built on GPU architecture) paid $20 billion to license Groq's LPU technology. This licensing agreement is significant as it indicates that NVIDIA is partly betting against its own architecture and that the LPU is not just some speculative design.
^{^}
In May 2026, it was reported that Anthropic is in discussions with Fractile to purchase their SRAM-based inference accelerators. Fractile's architecture co-locates memory and compute on the same die using SRAM, which removes the DRAM bottleneck.
^{^}
For the readers more familiar with operation primitives, we recommend that you read the tensor contraction paper to better understand the significance of this point.
^{^}
This projection extends only to 2028. Over that timeframe, GPU-based architectures are likely to remain dominant, so extrapolating within the current paradigm may not introduce large errors until 2028. However, the further out we extrapolate under a single-paradigm assumption, the less accurate this modeling becomes.
^{^}
Electronic Design Automation (EDA) refers to the software tools used to design, simulate, verify, and manufacture integrated circuits. EDA tools cover the full pipeline from chip architecture through to layout and fabrication-ready files, and are produced by a small number of dominant vendors (Synopsys, Cadence, Siemens).
^{^}
Some example tools and paper using AI for hardware design: Synopsys, DSO.ai (reinforcement learning for PPA optimization in chip design), VSO.ai (verification), TSO.ai (test), ASO.ai (analog design). Cadence, Cerebrus Intelligent Chip Explorer (ML-driven digital design flow optimization), Cerebrus AI Studio (agentic AI for multi-block SoC design). Siemens EDA, Calibre Vision AI (AI-assisted physical verification), Solido generative and agentic AI (generative and agentic AI for custom IC design and verification). NVIDIA, NVCell (RL-based standard cell layout generator), PrefixRL (RL-based arithmetic circuit design), ChipNeMo (domain-specific LLM for engineering assistance, EDA script generation, and bug analysis).
^{^}
As of May 2026, we estimate a total of ~$500 million has been publicly invested in startups focused on building AI tools for hardware design. This does not include funding in stealth, nor R&D budgets from major EDA vendors (Synopsys, Cadence, Siemens) and design firms (e.g., NVIDIA).
^{^}
Ricursive Intelligence was founded by Anna Goldie and Azalia Mirhoseini, who built AlphaChip at Google DeepMind. NVIDIA's venture arm participated in the company's early funding. The team draws from Google DeepMind, Anthropic, NVIDIA, Cadence, and Apple.

[-]StanislavKrym4h20

@Daniel Kokotajlo, what are the implications for the rewritten AI-2027 compute forecast? Does the loss of the HBM requirement mean that the main bottleneck on AI compute growth is advanced packaging? That agents can be scaled far more efficiently than you expected? Or were the new architectures priced into the forecast?