Compute Verification on Short Timelines

skunnavakkam

Compute verification might be a very significant part of coordination with respect to intelligent AI systems. In short timelines (<2030), the kind espoused by a frontier lab leader, I'm suspicious of die-level verification hardware being tractable. If this is the case, we should then focus on auxilary verification methods, such as post-hoc modifications to chips or robust software verification methods. In any case, we should act now if we want this to happen.

I'm going to be treating Nvidia as the only GPU manufacturer for the sake of simplicity, but at least in my conversations with semiconductor companies they all operate on similar manufacturing timelines (because they're all TSMC!), so this should be representative.

Nvidia currently ships Blackwell architecture GPUs, sometimes paired with the Grace CPU, which are B200s, GB200s, and GB300s. Later this year, Rubin GPUs will be released, paired with Vera CPUs. There was around a year between the release of Blackwell chips and the first models trained on Blackwell chips, so we could expect similar for Rubin chips. I'd expect that this gap is slightly shorter for Rubin chips. The generation after Rubin is Feynman, expected to be released at some point in 2028.

If we needed hardware verification at Die level, and we had the know-how to do so now, the earliest we could ship a verified chip would be with Feynman chips in 2028 if Nvidia could accomodate changes right now. Die-level verification is at least a few years away.

Can we verify compute at higher levels, and how long do modifications at different levels take? My estimates from my own experience and from conversations with semiconductor companies:

Die-level changes: >2 years
Board Level changes: depends on what sorts of modifications you're making. you could feasibly add an MCU to a board with a few months, but for example changing memory configurations depends heavily on the supply of memory, which is very constrained and inflexible
Software/Firmware changes: can be done in a few weeks, and realistically is 1-2 months. However, an advantage of these changes is that you can do them at ~any point in your production cycle, whereas die and board level changes need to be made earlier in the design process.

We might be able to have board changes for Feynman chips and software changes for most Rubin chips ^[1] !

These changes are quite promising. One proposal of a board-level change is an additional MCU / the existing hardware that hashes the weights that are sent to the die, and keeping a dictionary of hashes. If there are a huge number of different weights that are sent to the die, this implies weight updates. If there are only a few thousand different sets of weights, this implies no weight updates (Shavit 2023). Although this is imperfect, it shows that we may be able to find board / firmware level modifications that allow for compute verification

For compute verification to happen on short timelines, it cannot be done at the die level. Instead, board or firmware level changes are more tractable on short timelines. More importantly, such changes need to happen now if compute verification is to happen for Rubin chips, since the chips are shipping soon.

You may not be able to roll this change out for the earliest chips, but tapeouts happen over the course of years, and you can always push the change at a later portion in the production cycle. It still reduces the number of chips that are un-verified. ↩︎

13

Compute Verification on Short Timelines

13

13

13