I am an AI Policy Fellow at IAPS, where I am researching AI integrity and compute governance. Previously, I was a GovAI summer fellow, participant in ARENA 5.0, hardware security research assistant through the SPAR program, and security engineer at a hedge fund. I recently graduated from Columbia University in December 2024, where I studied computer science.
All views expressed here are my own.
Leave anonymous feedback here!
How bad do you think power centralization is? It's not obvious to me that power centralization guarantees S-risk. In general, I feel pretty confused about how a human god-emperor would behave, especially because many of the reasons that pushed past dictators to draconian rule may not apply when ASI is in the picture. For example, draconian dictators often faced genuine threats to their rule from rival factions, requiring brutal purges and surveillance states to maintain power, or they were stupid / overly paranoid (an ASI advisor could help them have better epistemics), etc. I'm keen to understand your POV better.
Wanted to give a heads up that the website https://situational-awareness-dataset.org/ is no longer up, as of Oct 28, 2025
Thank you for writing this PSA. I particularly appreciated that this post was:
2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I'm not sure if this reason is evidence that human learning -> human values is more relevant to predicting AGI than evolution -> human values. IIUC, you are arguing that if one conditions on both human learning -> human values and evolution -> human values being relevant to predicting AGI, then we get more bits of evidence from human learning -> human values because there are simply more instances of humans.
However, I think that Nate/MIRI[1] would argue that that human learning -> human values isn't relevant at all to predicting AGI because the outer optimization target for human learning is unspecified. Since we don't actually know what the outer optimization target is, it's not useful for making claims about sharp left turn or AGI misalignment.
Note that I don't necessarily agree with them.
Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people's bags of apples together on a table and then divide the contents among two people.
But that attractor well? It's got a free parameter. And that parameter is what the AGI is optimizing for. And there's no analogously-strong attractor well pulling the AGI's objectives towards your preferred objectives.
I agree that there is no "analogously-strong attractor well pulling the AGI's objectives" toward what I want, but I'm not convinced that we need an analogously-strong attractor well. Suppose that the attractor well of "thinking in accurate and useful ways" has 100 attractor-strength units (because it is dictated by simple empirical facts and laws), while the attractor well of "being a nice AI" has X units of strength. We agree that X must be less than 100 because "niceness" is hard to train for various reasons.
Would I be correct in saying that you think X is less than 10 (or at least, dramatically weaker than 100 strength units)? I, OTOH, expect that X is something like 50 (with wide error margins), and I expect that the strength of the "being a nice AI" well is sufficiently strong enough to make my probability of loss of control below 50%. I think that RL feedback on virtuousness is "strong enough" to engender a nice AI. I think that depending on how strong you think the "niceness" attractor state is quite cruxy for assessing the probability of a sharp left turn.
FWIW, I don't know enough about RL to have strong evidence for my view, but I struggle to understand why you are so confident that a sharp left turn will occur. I would really appreciate understanding your POV better! Thanks!
Relatedly, if you turn off watch history on youtube, the entire recommendation algorithm gets disabled. This means you can't access youtube shorts or recommended videos. Turning off watch history single-handedly fixed my youtube addiction (specifically, I no longer doomscroll on youtube)!
Great post! It's been almost a year since this was posted so I was curious if anyone has worked on these questions:
- Do you get any weird results from the pre-training data not being IID? Does this compromise capabilities in practice? Or does it lead to increased capabilities because the model cannot lean as much on memorization when it’s constantly getting trained on a previously-unseen future?
What if you want to run multiple epochs?[21] Then you have a conflict between wanting to fully update on the old data before you see new data vs. wanting to maximally spread out the points in time at which you repeat training data. How severe is this conflict? Are there any clever methods that could reduce it?
I did a quick lit review and didn't find much. Here's what I did find (not perfectly related to the above questions, though).
So, has anyone pursued the two quoted questions above? Super curious if anyone has good results!
Semi-cooperation is one way for both sides to learn from each other—but so is poor infosec or even outright espionage. If both countries are leaking or spying enough, that might create a kind of uneasy balance (and transparency), even without formal agreements. It’s not exactly stable, but it could prevent either side from gaining a decisive lead.
In fact, sufficiently bad infosec might even make certain forms of cooperation and mutual verification easier. For instance, if both countries are considering setting up trusted data centers to make verifiable claims about AGI development, the fact that espionage already permeates much of the AI supply chain could paradoxically lower the bar for trust. In a world where perfect secrecy is already compromised, agreeing to “good enough” transparency might become more feasible.
Why Steal Model Weights?
Epistemic status: Hastily written. I dictated in a doc for 7 minutes. Then I spent an hour polishing it. I don’t think there are any hot takes in this post? It’s mostly a quick overview of model weight security so I can keep track of my threat models.
Here’s a quick list of reasons why an attacker might steal frontier AI model weights (lmk if I'm missing something big):
But maybe they would get slowed down because of corporate bureaucracy?
I suppose you could try model inversion attacks or other attacks to try to gain more information about how the model was developed.
I assume only countries in the AGI race would buy the model.
Especially given that it might be very costly to steal a model because the labs would securitize more heavily in response. In other words, the attackers might only have “one chance” (I wouldn’t be surprised if the attackers have many chances because security is hard)
If a nation-state actor steals a model and open-sources or serves it via an API, then the original model developer (the company from whom the model was stolen) could easily figure out if it’s their model that was stolen. There are many ways to do this but one fun way would be to train a backdoor into the model that only activates on a specific trigger passphrase. Then, the original model developer could enter the trigger passphrase and see if it’s their model.
FWIW, I think this is very unlikely but maybe if cyber offense-defense balance is super offense dominant, then a bunch of low-capable actors could gain enough uplift to attack AGI labs and steal model weights. Even if this is plausible, I’m not concerned about these threat actors because the nation state threat actors would be worse and scarier.