I am an associate researcher at IAPS, where I research AI-driven power concentration, compute governance, and AI security. Previously, I was a GovAI summer fellow, participant in ARENA 5.0, hardware security research assistant through the SPAR program, and security engineer at a hedge fund. I graduated from Columbia University in December 2024, where I studied computer science.
All views expressed here are my own.
Leave anonymous feedback here!
Could you share the script/workflow/scaffolding you use to generate the visual metaphors and prompts for Midjourney? Thanks!
This was enormously helpful for me! Thanks for writing this
Epistemic status: Hastily written. I dictated in a doc for 7 minutes. Then I spent an hour polishing it. I don’t think there are any hot takes in this post? It’s mostly a quick overview of model weight security so I can keep track of my threat models.
Here’s a quick list of reasons why an attacker might steal frontier AI model weights (lmk if I'm missing something big):
But maybe they would get slowed down because of corporate bureaucracy?
I suppose you could try model inversion attacks or other attacks to try to gain more information about how the model was developed.
I assume only countries in the AGI race would buy the model.
Especially given that it might be very costly to steal a model because the labs would securitize more heavily in response. In other words, the attackers might only have “one chance” (I wouldn’t be surprised if the attackers have many chances because security is hard)
If a nation-state actor steals a model and open-sources or serves it via an API, then the original model developer (the company from whom the model was stolen) could easily figure out if it’s their model that was stolen. There are many ways to do this but one fun way would be to train a backdoor into the model that only activates on a specific trigger passphrase. Then, the original model developer could enter the trigger passphrase and see if it’s their model.
FWIW, I think this is very unlikely but maybe if cyber offense-defense balance is super offense dominant, then a bunch of low-capable actors could gain enough uplift to attack AGI labs and steal model weights. Even if this is plausible, I’m not concerned about these threat actors because the nation state threat actors would be worse and scarier.
How bad do you think power centralization is? It's not obvious to me that power centralization guarantees S-risk. In general, I feel pretty confused about how a human god-emperor would behave, especially because many of the reasons that pushed past dictators to draconian rule may not apply when ASI is in the picture. For example, draconian dictators often faced genuine threats to their rule from rival factions, requiring brutal purges and surveillance states to maintain power, or they were stupid / overly paranoid (an ASI advisor could help them have better epistemics), etc. I'm keen to understand your POV better.
Wanted to give a heads up that the website https://situational-awareness-dataset.org/ is no longer up, as of Oct 28, 2025
Thank you for writing this PSA. I particularly appreciated that this post was:
2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I'm not sure if this reason is evidence that human learning -> human values is more relevant to predicting AGI than evolution -> human values. IIUC, you are arguing that if one conditions on both human learning -> human values and evolution -> human values being relevant to predicting AGI, then we get more bits of evidence from human learning -> human values because there are simply more instances of humans.
However, I think that Nate/MIRI[1] would argue that that human learning -> human values isn't relevant at all to predicting AGI because the outer optimization target for human learning is unspecified. Since we don't actually know what the outer optimization target is, it's not useful for making claims about sharp left turn or AGI misalignment.
Note that I don't necessarily agree with them.
Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people's bags of apples together on a table and then divide the contents among two people.
But that attractor well? It's got a free parameter. And that parameter is what the AGI is optimizing for. And there's no analogously-strong attractor well pulling the AGI's objectives towards your preferred objectives.
I agree that there is no "analogously-strong attractor well pulling the AGI's objectives" toward what I want, but I'm not convinced that we need an analogously-strong attractor well. Suppose that the attractor well of "thinking in accurate and useful ways" has 100 attractor-strength units (because it is dictated by simple empirical facts and laws), while the attractor well of "being a nice AI" has X units of strength. We agree that X must be less than 100 because "niceness" is hard to train for various reasons.
Would I be correct in saying that you think X is less than 10 (or at least, dramatically weaker than 100 strength units)? I, OTOH, expect that X is something like 50 (with wide error margins), and I expect that the strength of the "being a nice AI" well is sufficiently strong enough to make my probability of loss of control below 50%. I think that RL feedback on virtuousness is "strong enough" to engender a nice AI. I think that depending on how strong you think the "niceness" attractor state is quite cruxy for assessing the probability of a sharp left turn.
FWIW, I don't know enough about RL to have strong evidence for my view, but I struggle to understand why you are so confident that a sharp left turn will occur. I would really appreciate understanding your POV better! Thanks!
(I tried making inline "typo" reactions but it wasn't working for some reason)
Typo: last bullet should be removed?
Typo: first and last bullet should be removed?
These same typos are also in the text on the forethought website, newsletter, and EAF post.