Interesting writeup.
I'd suggest that there is another concerning reason why we need to track utilization, not only capacity; you touched on it but I think could be elaborated (and I'm starting to look into it myself): well-optimized workloads using at best 50% of the hardware capacity means that there's at least another 50% that could be brought online without being reported. I wonder if that 50% could even end up being used by rogue workloads.
I meant that there's a large fraction of the capacity that is up for the taking by a sophisticated actor. E.g., see the PipeFill paper.
In that paper, that actor is the original operator optimizing their workloads. But a different, undeclared workload could be injected to take advantage of the unused hardware: from cryptomining, to training or inference.
I hesitate to suggest scenarios, but I'd imagine that a rogue AI's first need would be to find hardware to run on. One would think that available GPUs/TPUs/etc aren't easy to find; but turns out that there'... (read more)