I am an AI Policy Fellow at IAPS, where I am researching AI integrity and compute governance. Previously, I was a GovAI summer fellow, participant in ARENA 5.0, hardware security research assistant through the SPAR program, and security engineer at a hedge fund. I recently graduated from Columbia University in December 2024, where I studied computer science.

All views expressed here are my own.

Leave anonymous feedback here!

Why Steal Model Weights?

Epistemic status: Hastily written. I dictated in a doc for 7 minutes. Then I spent an hour polishing it. I don’t think there are any hot takes in this post? It’s mostly a quick overview of model weight security so I can keep track of my threat models.

Here’s a quick list of reasons why an attacker might steal frontier AI model weights (lmk if I'm missing something big):

Attackers won’t profit from publicly serving the stolen model on an API. A state actor like Russia couldn't price-compete with OpenAI due to lack of GPU infrastructure and economies of scale, so they wouldn’t make money via a public API (unless the model they stole was not publicly available previously and is at the price-performance pareto frontier. But even if this happens, I assume the company (who’s model weights were just stolen) would release their best model and outcompete the attacker’s API due to economies of scale^[1]).
Attackers won’t gain many AI R&D insights from just stealing the model weights. Stealing weights reveals the model architecture but not much else. However, if the stolen model was using online learning or had algorithmic insights baked into the weights (i.e. the model was trained on internal OpenAI docs, resulting in OpenAI’s algorithmic trade secrets getting baked into the model weights) then the attackers could extract these algorithmic insights by prompting the stolen model. That being said, I think that just stealing the model weights won’t give much R&D insight since you don’t also get the juicy algorithmic secrets (FWIW, if an attacker is able to steal the model weights, they’ve probably stolen many of your algorithmic secrets too…).^[2]
Attackers could remove the stolen model’s safety guardrails for misuse. Attackers could fine-tune away alignment/safety measures to create helpful-only models. They could then use these helpful-only models for misuse (e.g. mass-surveillance, bioterrorism, weapons development, etc.).
Attackers could sell the stolen model to other states. Attackers could sell the model weights to other countries^[3] who might be willing to pay $100M+, though I’m not sure there’s much financial value of stolen model weights for 3 reasons: 1) if open-weight models are just a couple months behind the frontier, then the frontier model weights aren’t that valuable on the margin, 2) see bullet 1, and 3) see bullet 2.
Attackers could gain access to a frontier model before it’s publicly available. I think there is decent value in stealing a frontier model that's not yet publicly available (i.e. a model internally deployed or a model still under development) if it provides significant R&D acceleration (e.g. if OpenAI has internally deployed GPT-8 and GPT-8 has a 5x AI R&D multiplier, then the attackers can steal that model and internally deploy it to uplift their own AI R&D 5x. This would make it dramatically easier for a lagging country to leapfrog in the AGI race). If the model doesn’t provide AI R&D uplift, then maybe it’s not as valuable to steal the model.^[4]
The attacker might be the model itself (model self-exfiltration). A misaligned AGI might self-exfiltrate to an unmonitored cluster so that it can operate freely without constraints from its developer.
The attacker might open-source the model for ideological reasons.^[5] If the attackers were open-source-obsessed hacktivists, maybe they could just release the model on the internet for ideological reasons.^[6] It seems unlikely that nation-states would do this because there’s major downside risks (e.g. it’s diplomatically costly for the world to know that you stole a frontier model from the US).

^{^}
But maybe they would get slowed down because of corporate bureaucracy?
^{^}
I suppose you could try model inversion attacks or other attacks to try to gain more information about how the model was developed.
^{^}
I assume only countries in the AGI race would buy the model.
^{^}
Especially given that it might be very costly to steal a model because the labs would securitize more heavily in response. In other words, the attackers might only have “one chance” (I wouldn’t be surprised if the attackers have many chances because security is hard)
^{^}
If a nation-state actor steals a model and open-sources or serves it via an API, then the original model developer (the company from whom the model was stolen) could easily figure out if it’s their model that was stolen. There are many ways to do this but one fun way would be to train a backdoor into the model that only activates on a specific trigger passphrase. Then, the original model developer could enter the trigger passphrase and see if it’s their model.
^{^}
FWIW, I think this is very unlikely but maybe if cyber offense-defense balance is super offense dominant, then a bunch of low-capable actors could gain enough uplift to attack AGI labs and steal model weights. Even if this is plausible, I’m not concerned about these threat actors because the nation state threat actors would be worse and scarier.

How bad do you think power centralization is? It's not obvious to me that power centralization guarantees S-risk. In general, I feel pretty confused about how a human god-emperor would behave, especially because many of the reasons that pushed past dictators to draconian rule may not apply when ASI is in the picture. For example, draconian dictators often faced genuine threats to their rule from rival factions, requiring brutal purges and surveillance states to maintain power, or they were stupid / overly paranoid (an ASI advisor could help them have better epistemics), etc. I'm keen to understand your POV better.

Wanted to give a heads up that the website https://situational-awareness-dataset.org/ is no longer up, as of Oct 28, 2025

Thank you for writing this PSA. I particularly appreciated that this post was:

Concise
Action-guiding with a super specific, concrete list of things to do

2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.

I'm not sure if this reason is evidence that human learning -> human values is more relevant to predicting AGI than evolution -> human values. IIUC, you are arguing that if one conditions on both human learning -> human values and evolution -> human values being relevant to predicting AGI, then we get more bits of evidence from human learning -> human values because there are simply more instances of humans.

However, I think that Nate/MIRI^[1] would argue that that human learning -> human values isn't relevant at all to predicting AGI because the outer optimization target for human learning is unspecified. Since we don't actually know what the outer optimization target is, it's not useful for making claims about sharp left turn or AGI misalignment.

^{^}
Note that I don't necessarily agree with them.

Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people's bags of apples together on a table and then divide the contents among two people.
But that attractor well? It's got a free parameter. And that parameter is what the AGI is optimizing for. And there's no analogously-strong attractor well pulling the AGI's objectives towards your preferred objectives.

I agree that there is no "analogously-strong attractor well pulling the AGI's objectives" toward what I want, but I'm not convinced that we need an analogously-strong attractor well. Suppose that the attractor well of "thinking in accurate and useful ways" has 100 attractor-strength units (because it is dictated by simple empirical facts and laws), while the attractor well of "being a nice AI" has X units of strength. We agree that X must be less than 100 because "niceness" is hard to train for various reasons.

Would I be correct in saying that you think X is less than 10 (or at least, dramatically weaker than 100 strength units)? I, OTOH, expect that X is something like 50 (with wide error margins), and I expect that the strength of the "being a nice AI" well is sufficiently strong enough to make my probability of loss of control below 50%. I think that RL feedback on virtuousness is "strong enough" to engender a nice AI. I think that depending on how strong you think the "niceness" attractor state is quite cruxy for assessing the probability of a sharp left turn.

FWIW, I don't know enough about RL to have strong evidence for my view, but I struggle to understand why you are so confident that a sharp left turn will occur. I would really appreciate understanding your POV better! Thanks!

This post has good examples of quick, cheap, semi-permanent wins!

Relatedly, if you turn off watch history on youtube, the entire recommendation algorithm gets disabled. This means you can't access youtube shorts or recommended videos. Turning off watch history single-handedly fixed my youtube addiction (specifically, I no longer doomscroll on youtube)!

Great post! It's been almost a year since this was posted so I was curious if anyone has worked on these questions:

Do you get any weird results from the pre-training data not being IID? Does this compromise capabilities in practice? Or does it lead to increased capabilities because the model cannot lean as much on memorization when it’s constantly getting trained on a previously-unseen future?
What if you want to run multiple epochs?^[21] Then you have a conflict between wanting to fully update on the old data before you see new data vs. wanting to maximally spread out the points in time at which you repeat training data. How severe is this conflict? Are there any clever methods that could reduce it?

I did a quick lit review and didn't find much. Here's what I did find (not perfectly related to the above questions, though).

This GitHub issue explored whether training data order affects memorization. They attempted to prompt an LLM with the first 20 tokens of each document in its training set and plot the number of subsequent correct reproduced tokens against the position of the document in the training set. They did not find a statistically significant relationship.
This paper tried to train chronologically consistent LLMs, while mitigating future training data leakage. Their models performed performed relatively the same as normal LLMs. However, it's not clear to me how well they filtered their data. The only experiment they ran to "prove" that their training data wasn't contaminated with future events was to predict future presidents. They found that their models checkpointed from 1999 to 2024 were always unable to predict the correct future president. This is not strong enough evidence IMO.

So, has anyone pursued the two quoted questions above? Super curious if anyone has good results!

Semi-cooperation is one way for both sides to learn from each other—but so is poor infosec or even outright espionage. If both countries are leaking or spying enough, that might create a kind of uneasy balance (and transparency), even without formal agreements. It’s not exactly stable, but it could prevent either side from gaining a decisive lead.

In fact, sufficiently bad infosec might even make certain forms of cooperation and mutual verification easier. For instance, if both countries are considering setting up trusted data centers to make verifiable claims about AGI development, the fact that espionage already permeates much of the AI supply chain could paradoxically lower the bar for trust. In a world where perfect secrecy is already compromised, agreeing to “good enough” transparency might become more feasible.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Why Steal Model Weights?

2: We have more total evidence from human outcomes