This research note provides a brief overview of our recent work on reverse engineering the memory layout of an inference process running on a modern hardware accelerator. We situate this work as follows:
While our previous host-side work provided a useful stepping stone, the on-device setting presented several novel obstacles which required us to refine our approach:
To the best of our knowledge, this is the first time when memory activity has been comprehensively tracked on a per-page basis on modern GPUs, albeit with error bounds inherited from the count-min sketch. The several hurdles posed by the sheer volume of data emitted by the embarrassingly parallel hardware may explain this. Note that we later argue that we may have used a machine learning hammer to cast a computer science problem as a nail, and that a more elegant approach to segmentation may be possible, though the bitter lesson will tell.
Despite the novelty of instrumenting kernels to "track themselves" using count-min sketches, the general approach to memory segmentation remained the same as before: treat it as a machine learning problem.
Where do we go from this feasibility study? Several directions and implications are relevant: