I am an AI Policy Fellow at IAPS, where I am researching AI integrity and compute governance. Previously, I was a GovAI summer fellow, participant in ARENA 5.0, hardware security research assistant through the SPAR program, and security engineer at a hedge fund. I recently graduated from Columbia University in December 2024, where I studied computer science.
I have signed no contracts or agreements whose existence I cannot mention.
Thank you for writing this PSA. I particularly appreciated that this post was:
2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from "human learning -> human values" compared to from "evolution -> human values". There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from "human learning -> human values" should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I'm not sure if this reason is evidence that human learning -> human values is more relevant to predicting AGI than evolution -> human values. IIUC, you are arguing that if one conditions on both human learning -> human values and evolution -> human values being relevant to predicting AGI, then we get more bits of evidence from human learning -> human values because there are simply more instances of humans.
However, I think that Nate/MIRI[1] would argue that that human learning -> human values isn't relevant at all to predicting AGI because the outer optimization target for human learning is unspecified. Since we don't actually know what the outer optimization target is, it's not useful for making claims about sharp left turn or AGI misalignment.
Note that I don't necessarily agree with them.
Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people's bags of apples together on a table and then divide the contents among two people.
But that attractor well? It's got a free parameter. And that parameter is what the AGI is optimizing for. And there's no analogously-strong attractor well pulling the AGI's objectives towards your preferred objectives.
I agree that there is no "analogously-strong attractor well pulling the AGI's objectives" toward what I want, but I'm not convinced that we need an analogously-strong attractor well. Suppose that the attractor well of "thinking in accurate and useful ways" has 100 attractor-strength units (because it is dictated by simple empirical facts and laws), while the attractor well of "being a nice AI" has X units of strength. We agree that X must be less than 100 because "niceness" is hard to train for various reasons.
Would I be correct in saying that you think X is less than 10 (or at least, dramatically weaker than 100 strength units)? I, OTOH, expect that X is something like 50 (with wide error margins), and I expect that the strength of the "being a nice AI" well is sufficiently strong enough to make my probability of loss of control below 50%. I think that RL feedback on virtuousness is "strong enough" to engender a nice AI. I think that depending on how strong you think the "niceness" attractor state is quite cruxy for assessing the probability of a sharp left turn.
FWIW, I don't know enough about RL to have strong evidence for my view, but I struggle to understand why you are so confident that a sharp left turn will occur. I would really appreciate understanding your POV better! Thanks!
Relatedly, if you turn off watch history on youtube, the entire recommendation algorithm gets disabled. This means you can't access youtube shorts or recommended videos. Turning off watch history single-handedly fixed my youtube addiction (specifically, I no longer doomscroll on youtube)!
Great post! It's been almost a year since this was posted so I was curious if anyone has worked on these questions:
- Do you get any weird results from the pre-training data not being IID? Does this compromise capabilities in practice? Or does it lead to increased capabilities because the model cannot lean as much on memorization when it’s constantly getting trained on a previously-unseen future?
What if you want to run multiple epochs?[21] Then you have a conflict between wanting to fully update on the old data before you see new data vs. wanting to maximally spread out the points in time at which you repeat training data. How severe is this conflict? Are there any clever methods that could reduce it?
I did a quick lit review and didn't find much. Here's what I did find (not perfectly related to the above questions, though).
So, has anyone pursued the two quoted questions above? Super curious if anyone has good results!
Semi-cooperation is one way for both sides to learn from each other—but so is poor infosec or even outright espionage. If both countries are leaking or spying enough, that might create a kind of uneasy balance (and transparency), even without formal agreements. It’s not exactly stable, but it could prevent either side from gaining a decisive lead.
In fact, sufficiently bad infosec might even make certain forms of cooperation and mutual verification easier. For instance, if both countries are considering setting up trusted data centers to make verifiable claims about AGI development, the fact that espionage already permeates much of the AI supply chain could paradoxically lower the bar for trust. In a world where perfect secrecy is already compromised, agreeing to “good enough” transparency might become more feasible.
Thanks for the comment. Strong upvoted!
I agree that the quotations described as "backwards" are not necessarily wrong given the two possible (and reasonable) interpretations of the RLHF procedure. Thanks for flagging this subtlety; I had not thought of it before. I will update the body of the post to reflect this subtlety.
Meta point: I'm so grateful for the LessWrong community. This is my first post and first comment, and I find it so wild that I'm part of a community where people like you write such insightful comments. It's very inspiring :)
Wanted to give a heads up that the website https://situational-awareness-dataset.org/ is no longer up, as of Oct 28, 2025