Researcher at MIRI
Is there a side-effect of unwanted hair growth?
They're in the original blog post: https://sean-peters-au.github.io/2025/07/02/ai-task-length-horizons-in-offensive-cybersecurity.html
But it would be good to update this LW post
Here's my shot at a simple argument for pausing AI.
We might soon hit a point of no return and the world is not at all ready.
A central point of no return is if we kick off a recursive automated AI R&D feedback loop (i.e., an intelligence explosion), where the AI systems get smarter and more capable, and humans are totally unable to keep up. I can imagine humans nominally still being in the loop but not actually understanding things, or being totally reliant on AIs explaining dumbed down versions of the new AI techniques being discovered.
There are other points of no return that are less discrete, such as if states become economically or militarily reliant on AI systems. Maybe due to competitive dynamics with other states, or just because the AIs are so damn useful and it would be too inconvenient to remove them from all the societal systems they are now a part of. See "The date of AI Takeover is not the day the AI takes over" for related discussion.
If we hit a point of no return and develop advanced AI (including superintelligent AI), this will come with a whole range of problems that the world is not ready for. I think any of these would be reasonable grounds for pausing until we can deal with them.[1]
The world is not on track to solve these problems. On the current trajectory of AI development, we will likely run head-first into these problems wildly unprepared.
Somewhat adapted from our research agenda.
I liked this post and thought it gave a good impression of just how crazy AIs could get if we allow progress to continue. It also made me even more confident that we really cannot allow AI progress to continue unabated, at least not to the point where AIs are automating AI R&D and getting to this level of capability.
I also think it is very unlikely that AIs 4 SDs above the human range would be controllable, I'd expect them to be able to fairly easily sabotage research they were given without humans noticing. When I think of intelligence gaps like that in humans it feels pretty insurmountable
Have you contacted the big AI companies (OpenAI, Anthropic, GDM, Meta?) and asked them if they can remove this from their scrapes?
I claim that this example generalizes: insofar as Joe’s “fake thinking” vs “real thinking” points to a single coherent distinction, it points to thoughts which represent things in other worlds vs thoughts which represent things in our physical world.
This doesn’t feel quite right to me, or at least is missing something. When I think about Joe’s “fake thinking” vs “real thinking”, the main distinction is about whether you are “actually trying” or “actually care”.
When I was 20, I was well aware of the horrors of factory farming, I would said things like “future generations will look back and consider this among the worst moral crimes in history”. But I still ate factory farmed meat, and I didn’t take any actions that showed I cared. My thinking about factory farming was kind of “academic” or an interesting clever and slightly contrarian view, but it didn’t have any real weight behind it. This is despite me knowing that my thoughts referred to the real world.
I orient very differently to factory farming now. I don’t eat meat, and sometimes when I think about the scale, I feel awful, like I’ve been punched in the gut or that I want to cry, and knowing even then that this reaction isn’t at all sufficient for the actual scale. This feels much more real.
I think that maybe you could use this “fictional” vs “real” framing to say that previously I was thinking about factory farming in a kind of fictional way, and that on some level I didn’t actually believe that my thoughts corresponded to a referent in the real/physical world. But this seems a bit off, given that I did know that these things were in the real world.
For steps 2-4, I kinda expect current neural nets to be kludgy messes, and so not really have the nice subagent structure (even if you do step 1 well enough to have a thing to look for).
I'm also fairly pessimistic about step 1, but would be very excited to know what preliminary work here looks like.
Update: 4o seems happy to talk about sycophancy now
I get this with 4o, but not o3. o3 talks about sycophancy in both its CoT and its answers.
Claude 4 Sonnet and Opus also easily talk about sycophancy.
Maybe it’s hard to communicate nuance, but it seems like there's a crazy thing going on where many people in the AI x-risk community think something like “Well obviously I wish it would stop, and the current situation does seem crazy and unacceptable by any normal standards of risk management. But there’s a lot of nuance in what I actually think we should do, and I don’t want to advocate for a harmful stop.”
And these people end up communicating to external people something like “Stopping is a naive strategy, and continuing (maybe with some safeguards etc) is my preferred strategy for now.”
This seems to miss out the really important part where they would actually want to stop if we could, but it seems hard and difficult/nuanced to get right.