Epistemic status: Hastily written. I dictated in a doc for 7 minutes. Then I spent an hour polishing it. I don’t think there are any hot takes in this post? It’s mostly a quick overview of model weight security so I can keep track of my threat models.
Here’s a quick list of reasons why an attacker might steal frontier AI model weights (lmk if I'm missing something big):
But maybe they would get slowed down because of corporate bureaucracy?
I suppose you could try model inversion attacks or other attacks to try to gain more information about how the model was developed.
I assume only countries in the AGI race would buy the model.
Especially given that it might be very costly to steal a model because the labs would securitize more heavily in response. In other words, the attackers might only have “one chance” (I wouldn’t be surprised if the attackers have many chances because security is hard)
If a nation-state actor steals a model and open-sources or serves it via an API, then the original model developer (the company from whom the model was stolen) could easily figure out if it’s their model that was stolen. There are many ways to do this but one fun way would be to train a backdoor into the model that only activates on a specific trigger passphrase. Then, the original model developer could enter the trigger passphrase and see if it’s their model.
FWIW, I think this is very unlikely but maybe if cyber offense-defense balance is super offense dominant, then a bunch of low-capable actors could gain enough uplift to attack AGI labs and steal model weights. Even if this is plausible, I’m not concerned about these threat actors because the nation state threat actors would be worse and scarier.