As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
Today I’ll start by explaining “the plan” as I understood it in 2024.
Tomorrow, I’ll explain why this question seems so salient to me, and why the situation looks much worse than when I was reflecting on this question two years ago in 2024. These reasons include: many of our governance and policy plans have failed (in ways that reflect poorly on my naivete in 2024), AI progress is going along more aggressively timelines, the community has largely went “all-in” on Anthropic and lost its independence, some of the more ambitious technical research plans have not paid out, and the political situation both domestically in the US and internationally is quite bad.
Then, the day after that, I’ll write out why I think the answer is no. First, there are reasons for optimism compared to my view in 2024, including: the situation on wing-it–style empirical alignment is a fair bit better than expected, it seems more likely to me that Anthropic will be able to achieve and maintain a lead, and I think it’s more likely that non-US governments will have leverage over the course of AI development. Many reasons for hope in 2024 also still apply, including the fact that almost no one wants to die to misaligned AI, and that the US public is incredibly skeptical of AI and big tech in general. I also think there are a fair bit of silver linings to many of the negative updates (as the quip goes, “sometimes bad things are good”). I conclude by briefly outlining some of the ways I think people like myself could still make a difference, which I hope to expand into a larger post in the near future.
The plan from 2024
A quick sketch of the plan for “victory” as I understood it in mid 2024:
Buy time to burn. The leading approach was to use voluntary conditional commitments (RSP) and red-lines style governance interventions. A distant, dispreferred alternative was simply to “win” the AI race hard enough to have many months of lead to burn.
Develop powerful AI and extract useful cognitive labor from them. This involved both 1. Making the powerful (but not too powerful) research AI and more importantly 2. Developing techniques for aligning or controlling the AIs. It’s okay if these techniques are adhoc and haven’t been shown to scale (e.g. scalable oversight, but not that scalable), they just have to scale to models that can e.g. 2-3x research productivity.
Find ways to convert AI assistance into technical and policy solutions. Assuming you couldn’t stop AI development for a long period of time, you’d have to use the AI cognitive labor during the time bought in order to actually make ASI go well. I think of the three steps, the least effort went into figuring out this part of the plan – for example, many hoped that we could simply “wing it” all the way up.
Some of the key assumptions behind this plan include:
It’s infeasible to halt the development of AI in the immediate future, but we might be able to get the political will for conditional safety commitments that could allow for pauses when the situation gets “truly” dangerous.
We don’t have the time nor ability to solve the technical or problems of alignment ourselves; we need to use AI-assistance to do this.
We can develop techniques to extract much useful labor from relatively weak AI systems, even if these techniques don’t generalize to
We can specify the technical problems well enough that we can point AI systems at them, even if the AI systems only “really” work on easily specifiable or easily checkable tasks.
This suggested the following approaches:
Investing heavily in Anthropic, as a place to do technical research, a place to pioneer voluntary commitments, and a way to buy a “seat at the table” to invest.
Pushing for evals as an area of investment for talent. This was primarily under the “buy time” part of the plan, both as a way of gathering political will and as a necessary condition to implement voluntary conditional commitments.
Developing empirical alignment/control techniques that work on current systems.
Invest in scoping out currently infeasible approaches such as mechanistic interpretability that we might be able to pour AI effort into in the future.
To a large extent, the community did actually do the plan; the community put in a ton of effort into each of the above approaches.
But unfortunately, as I’ll write about tomorrow, not everything went according to plan.
Written very quickly for the Inkhaven Residency.
As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
Today I’ll start by explaining “the plan” as I understood it in 2024.
Tomorrow, I’ll explain why this question seems so salient to me, and why the situation looks much worse than when I was reflecting on this question two years ago in 2024. These reasons include: many of our governance and policy plans have failed (in ways that reflect poorly on my naivete in 2024), AI progress is going along more aggressively timelines, the community has largely went “all-in” on Anthropic and lost its independence, some of the more ambitious technical research plans have not paid out, and the political situation both domestically in the US and internationally is quite bad.
Then, the day after that, I’ll write out why I think the answer is no. First, there are reasons for optimism compared to my view in 2024, including: the situation on wing-it–style empirical alignment is a fair bit better than expected, it seems more likely to me that Anthropic will be able to achieve and maintain a lead, and I think it’s more likely that non-US governments will have leverage over the course of AI development. Many reasons for hope in 2024 also still apply, including the fact that almost no one wants to die to misaligned AI, and that the US public is incredibly skeptical of AI and big tech in general. I also think there are a fair bit of silver linings to many of the negative updates (as the quip goes, “sometimes bad things are good”). I conclude by briefly outlining some of the ways I think people like myself could still make a difference, which I hope to expand into a larger post in the near future.
The plan from 2024
A quick sketch of the plan for “victory” as I understood it in mid 2024:
Some of the key assumptions behind this plan include:
This suggested the following approaches:
To a large extent, the community did actually do the plan; the community put in a ton of effort into each of the above approaches.
But unfortunately, as I’ll write about tomorrow, not everything went according to plan.