Authors: Roy Rinberg, Adam Karvonen, Alex Hoover, Daniel Reuter, Keri Warr
Arxiv paper link
One Minute Summary
Anthropic has adopted upload limits to prevent model weight exfiltration. The idea is simple: model weights are very large, text outputs are small, so if we cap the output bandwidth, we can make model weight transfer take a long time. The problem is that inference servers now generate an enormous amount of tokens (on the order of ~1TB tokens per day), and the output text channel is the one channel you can't easily restrict.
Nonetheless, in this work we find that it’s possible to dramatically limit the amount of information an adversary can send using those output tokens. This is because LLM inference is nearly deterministic: if you fix the sampling seed and regenerate an output, over ~98% of tokens match exactly. This means an attacker attempting to send secret information via steganography, the practice of embedding hidden messages inside otherwise normal-looking text, has very little entropy in the user channel to work with. We show that steganographic exfiltration can be limited to <0.5% of the total information being sent through the user channel (e.g. from 1TB/day to 5 GB/day), extending exfiltration timelines by over 200x.
Beyond exfiltration, the same verification technique can detect inference bugs, providing immediate value for labs today.
Paper: Verifying LLM Inference to Prevent Model Weight Exfiltration
You are placed in a datacenter with root access - how would you steal model weights out of a datacenter without getting caught?
* Your first idea may be to stick a usb into a server and offload the weights, then walk out with them (that’s what Snowden did).
* If that wasn’t available, perhaps your next idea might be to directly upload the weights to some 3rd party server (e.g. some cloud provider)
However, these more traditional attacks have more traditional defenses and many channels are relatively easy to restrict; channels such as d