CEO at Redwood Research.
AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.
Please contact me via email (bshlegeris@gmail.com) instead of messaging me on LessWrong.
If we are ever arguing on LessWrong and you feel like it's kind of heated and would go better if we just talked about it verbally, please feel free to contact me and I'll probably be willing to call to discuss briefly.
I was trying to note that the answers are bounded above too, and in this particular case we can infer that at least a quarter of Americans have insane takes here. (Though the math I did was totally wrong.)
Sorry, you're totally right.
I feel like reporting the median is much simpler than these other proposals, and is probably what should be adopted.
I would note that by the Markov inequality, at least 25% of Americans must think that foreign aid is more than 25% of the budget in order to get the average response we see here. So I think it's reasonable to use the reported mean to conclude that at least a sizable minority of Americans are very confused here.
Before this post, I'm not aware of anything people had written on what might happen after you catch your AI red-handed. I basically stand by everything we wrote here.
I'm a little sad that there hasn't been much research following up on this. I'd like to see more, especially research on how you can get more legible evidence of misalignment from catching individual examples of your AI's behaving badly, and research on few-shot catastrophe detection techniques.
The point I made in this post still seems very important to me, and I continue to think that it was underrated at the time I wrote this post. I think rogue internal deployments are probably more important to think about than self-exfiltration when you're thinking about how to mitigate risk from internal deployment of possibly-misaligned AI agents.
The systems architecture that I described here is still my best guess as to how agents will work at the point where AIs are very powerful.
Since I wrote this post, agent scaffolds are used much more in practice. The infrastructure I described here is a good description of cloud-based agents, but isn't the design used by agents that you run on your own computer like Claude Code or Gemini CLI or whatever. I think agents will move in the direction that I described, especially as people want to be able to work with more of them, want to give them longer tasks, and want them to be able to use their own virtual machines for programming so they don't step on each other's toes all the time.
The terminology I introduced here is used widely by people who I know who think about insider threat from AI agents, but it hasn't penetrated that far outside my cluster as far as I know.
I think the points made in this post are very important and I reference them constantly. I am proud of it and I think it was good that we wrote it.
I agree there's been a lot of scientific progress, and real GDP per capita, which is maybe the most canonical single metric, continues to rise steadily.
But yeah, I think that this feels underwhelming to people compared to earlier qualitative changes. I think this is some combination of them noting that tech advances affect their lives less, and the tech advances feeling more opaque.
One thing I notice when reading 20th century history is that people in the 1900s-1970s had much higher priors than modern people do that the future might be radically different, in either great or terrible ways. For example:
I really feel like the ambient cultural sense among educated Americans is: the future will be kind of like the present, treating it as if something radical will happen is naive. (They sort of say that they think climate change will be apocalyptic, but it feels to me like what they're really imagining is that the world is "enshittified" further, in the same way that it sucks that DoorDash is now expensive, and maybe poor people elsewhere die.)
I think this is probably mostly because there's an important sense in which world has been changing more slowly (at least from the perspective of Americans), and the ways in which it's changing feel somehow less real. Someone who was 50 in 1945 had seen the collapse of empires that had lasted centuries, unprecedented wars, the sudden shocking rise of Communism, the invention and mass adoption of cars, radio, tanks, etc. That's just way way crazier than anything that 50 year old Americans have seen. And the main technological advances--phones, internet, social media, and recently AI--seem somehow subtler and easier to ignore, even though they have an objectively large effect on people's experience of life and on how society functions.
I think that people of the past might have reacted with more credulity to some of our claims about transformative AI.
I often feel like people I'm talking to are demonstrating an embarrassing lack of historical context when they implicitly imagine that states will be stable and that technology won't drastically change the world. (Or sometimes they say "usually it works better to trade with people than to overpower them", and my response is "that is really not a historical universal!")
Eliezer sometimes talks about how people are ruined by modern culture, in a way only fixable by reading 1950s sci-fi (or something like this, I don't remember). I wonder how much of what he's talking about is related to this.
Yeah, so a key part of the proposal here is that the verifier needs to know the seed. This seems pretty doable in practice.