Nvidia Rubin FP8 performance is 3.5x that of Blackwell, compared to only 1.6x for BF16. For Hopper and Blackwell, FP8 performance used to be 2x that of BF16 performance (for the same chip), but for Rubin it's 4.4x.
Nvidia is betting on FP8 at a cost for BF16, which suggests the largest models can now be reliably pretrained in FP8. Hardware that abandons full support of BF16 performance will be significantly more performant in FP8. TPUv7 (Ironwood) still maintains a 2x performance ratio between FP8 and BF16, but it's Blackwell's contemporary, so it's interesting what's going to happen with TPUv8 (which will be Rubin's contemporary).
In terms of (dense) FP8 performance per GW of all-in IT power, Rubin NVL72 [1] gives about 5.6e21 FLOP/s per GW, GB300 NVL72 about 2e21 FLOP/s per GW, TPUv7 about 3.5e21 FLOP/s per GW. Blackwell is a 4nm chip, while both Rubin and TPUv7 are 3nm chips, which might explain how TPUv7 FP8 performance per GW is in the middle, better than Blackwell, even though like Blackwell it doesn't abandon BF16, but worse than Rubin, since Rubin does take advantage of partially abandoning BF16.
(SemiAnalysis estimates 225 kW per rack for Rubin NVL72 (3.1 kW per chip) in total IT power, and 180 kW per rack for GB300 NVL72 (2.5 kW per chip). For $42bn of the Anthropic contract, SemiAnalysis estimates that 600K TPUv7 chips rented through GCP will need 788 MW of IT power, 1.31 kW per chip. FP8 performance is 5e15 FLOP/s per chip for Blackwell, 17.5e15 FLOP/s for Rubin, 4.6e15 FLOP/s for TPUv7. The difference in power per chip between Rubin and TPUv7, both being 3nm chips, is striking and suspicious, suggesting a 2x compute die vs. package unit counting error, but the FLOP/s per GW numbers make sense, suggesting there is no error.)
Hopper buildout happened in 2024, GB300 NVL72 is happening in 2026, Rubin NVL72 will be happening in 2027-2028, though it makes sense for the largest individual datacenter sites (more relevant for pretraining) to wait for Rubin U