Why Steal Model Weights?

Epistemic status: Hastily written. I dictated in a doc for 7 minutes. Then I spent an hour polishing it. I don’t think there are any hot takes in this post? It’s mostly a quick overview of model weight security so I can keep track of my threat models.

Here’s a quick list of reasons why an attacker might steal frontier AI model weights (lmk if I'm missing something big):

Attackers won’t profit from publicly serving the stolen model on an API. A state actor like Russia couldn't price-compete with OpenAI due to lack of GPU infrastructure and economies of scale, so they wouldn’t make money via a public API (unless the model they stole was not publicly available previously and is at the price-performance pareto frontier. But even if this happens, I assume the company (who’s model weights were just stolen) would release their best model and outcompete the attacker’s API due to economies of scale^[1]).
Attackers won’t gain many AI R&D insights from just stealing the model weights. Stealing weights reveals the model architecture but not much else. However, if the stolen model was using online learning or had algorithmic insights baked into the weights (i.e. the model was trained on internal OpenAI docs, resulting in OpenAI’s algorithmic trade secrets getting baked into the model weights) then the attackers could extract these algorithmic insights by prompting the stolen model. That being said, I think that just stealing the model weights won’t give much R&D insight since you don’t also get the juicy algorithmic secrets (FWIW, if an attacker is able to steal the model weights, they’ve probably stolen many of your algorithmic secrets too…).^[2]
Attackers could remove the stolen model’s safety guardrails for misuse. Attackers could fine-tune away alignment/safety measures to create helpful-only models. They could then use these helpful-only models for misuse (e.g. mass-surveillance, bioterrorism, weapons development, etc.).
Attackers could sell the stolen model to other states. Attackers could sell the model weights to other countries^[3] who might be willing to pay $100M+, though I’m not sure there’s much financial value of stolen model weights for 3 reasons: 1) if open-weight models are just a couple months behind the frontier, then the frontier model weights aren’t that valuable on the margin, 2) see bullet 1, and 3) see bullet 2.
Attackers could gain access to a frontier model before it’s publicly available. I think there is decent value in stealing a frontier model that's not yet publicly available (i.e. a model internally deployed or a model still under development) if it provides significant R&D acceleration (e.g. if OpenAI has internally deployed GPT-8 and GPT-8 has a 5x AI R&D multiplier, then the attackers can steal that model and internally deploy it to uplift their own AI R&D 5x. This would make it dramatically easier for a lagging country to leapfrog in the AGI race). If the model doesn’t provide AI R&D uplift, then maybe it’s not as valuable to steal the model.^[4]
The attacker might be the model itself (model self-exfiltration). A misaligned AGI might self-exfiltrate to an unmonitored cluster so that it can operate freely without constraints from its developer.
The attacker might open-source the model for ideological reasons.^[5] If the attackers were open-source-obsessed hacktivists, maybe they could just release the model on the internet for ideological reasons.^[6] It seems unlikely that nation-states would do this because there’s major downside risks (e.g. it’s diplomatically costly for the world to know that you stole a frontier model from the US).

^{^}
But maybe they would get slowed down because of corporate bureaucracy?
^{^}
I suppose you could try model inversion attacks or other attacks to try to gain more information about how the model was developed.
^{^}
I assume only countries in the AGI race would buy the model.
^{^}
Especially given that it might be very costly to steal a model because the labs would securitize more heavily in response. In other words, the attackers might only have “one chance” (I wouldn’t be surprised if the attackers have many chances because security is hard)
^{^}
If a nation-state actor steals a model and open-sources or serves it via an API, then the original model developer (the company from whom the model was stolen) could easily figure out if it’s their model that was stolen. There are many ways to do this but one fun way would be to train a backdoor into the model that only activates on a specific trigger passphrase. Then, the original model developer could enter the trigger passphrase and see if it’s their model.
^{^}
FWIW, I think this is very unlikely but maybe if cyber offense-defense balance is super offense dominant, then a bunch of low-capable actors could gain enough uplift to attack AGI labs and steal model weights. Even if this is plausible, I’m not concerned about these threat actors because the nation state threat actors would be worse and scarier.

LESSWRONG
LW

LESSWRONG
LW

Dave Banerjee's Shortform

2

Why Steal Model Weights?