As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which AI doom becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
Yesterday I laid out “the plan” as I understood it in 2024.
Today, I’ll explain the reasons I’ve become more pessimistic on the 2024 plan. (And tomorrow, I’ll talk about why I think the answer is still no.)
Reasons for more doom
(Unilateral) voluntary commitments from companies seem unlikely to hold
In our original RSP blog post, we outlined a vision for RSPs as companies “committing to gate scaling on concrete evaluations and empirical observations”, where “we should expect to halt AI development in cases where we do see dangerous capabilities, and continue it in cases where worries about dangerous capabilities were overblown.” I think empirically, this vision of RSPs seems unlikely to work.
First, many of the frontier safety policies that followed Anthropics were substantially less strict or well specified. DeepMind and OpenAI’s policies were probably the next best, but they were substantially less strict than Anthropics in terms of acceptable conditions to continue deployment. (Also, the OpenAI Preparedness team seems to have suffered a fair number of the usual OpenAI Safety departures.) Many company RSPs make no reference to actually terminating deployment, and XAI’s policy has plausibly already been violated as of early 2026.
Even Anthropic, which pioneered RSPs in collaboration with METR back in 2024, has updated their RSP to be substantially less strict than was outlined.
AI progress seems to be consistent with faster timelines
In 2024, there was a sense that AI progress was fast, but there wasn’t a good sense of exactly how fast it was. I think in 2026, it seems fairly clear that AI progress is fast enough that we can’t rule out ~full coding automation by 2028, let alone 2030 or 2035.
For one thing, the METR graph, originally released in March of 2025, found a consistent exponential trend in the length of tasks that AIs could complete, which has held up (or even accelerated) in the following year:
We’ve also just seen the massive use of AI coding assistants as well as massive scale ups in investment from leading labs, just to name two other signs of fast progress.
Insofar as our plans required more time, the faster timeline is definitely bad news.
Ambitious technical research has (largely) not paid out
In 2024, (ambitious) mechanistic interpretability was contested but still a major area of investment. There were also a bunch of other ambitious AI safety research agendas that aimed to rigorously tackle substantial parts of the AI alignment problem, such as ARC’s ELK/Heuristic Arguments Agenda, Singular Learning Theory, or non-agentic AI scientists. While work continues to be done on each of these agendas, none of these agendas has really paid off in a massive way.
The community has largely concentrated its investment into Anthropic
In 2024, Anthropic was probably the largest consumer of skill-weighted AI Safety talent of any organization in the world. However, it was still only a plurality, and not a majority.
I think in 2026, Anthropic is clearly a majority consumer of skill-weighted AI Safety talent. This isn’t clearly a bad thing in itself: insofar as the “make Anthropic win” plan is the main way to have the margin to invest in less-competitive but safer approaches, this is pretty much what would have to happen. But insofar as Anthropic as an organization has problems, for example by being wrong about key questions in AI safety, or driven to motivated thinking for profit or status reasons, or not able to efficiently use marginal talent, then this is a clearly bad thing from the perspective of the community.
Regardless of whether or not it was net good, I think it’s still sad that due to Anthropic’s commercial success (and several outflows of talent from OpenAI) the AI Safety community does not really have an independent existence outside of Anthropic.
The current US administration has many bad qualities from an AI Safety standpoint, and explicitly opposes "AI Safety"
On November 5th, after I saw the early results come in, I thought to myself that we were now playing on hard mode in terms of AI governance. I think my assessment has held up pretty well given what has happened since.
Many of the developments that I considered positive signs for domestic AI governance were explicitly opposed and or revoked by the current US Administration. The current US administration has also taken many actions that have made international cooperation on AI safety much less likely, including by antagonizing many of America’s existing allies. Also, key members of the administration have explicitly opposed AI Safety (and some have even espoused deranged conspiracy theories about how all of AI Safety is an attempt at regulatory capture by Anthropic).
The administration has just been exceedingly stupid, corrupt, and chaotic in general. (A recent example of this is designating Anthropic as a supply chain risk as a (failed) negotiating tactic.)
Written very quickly for the Inkhaven Residency.
As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which AI doom becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
Yesterday I laid out “the plan” as I understood it in 2024.
Today, I’ll explain the reasons I’ve become more pessimistic on the 2024 plan. (And tomorrow, I’ll talk about why I think the answer is still no.)
Reasons for more doom
(Unilateral) voluntary commitments from companies seem unlikely to hold
In our original RSP blog post, we outlined a vision for RSPs as companies “committing to gate scaling on concrete evaluations and empirical observations”, where “we should expect to halt AI development in cases where we do see dangerous capabilities, and continue it in cases where worries about dangerous capabilities were overblown.” I think empirically, this vision of RSPs seems unlikely to work.
First, many of the frontier safety policies that followed Anthropics were substantially less strict or well specified. DeepMind and OpenAI’s policies were probably the next best, but they were substantially less strict than Anthropics in terms of acceptable conditions to continue deployment. (Also, the OpenAI Preparedness team seems to have suffered a fair number of the usual OpenAI Safety departures.) Many company RSPs make no reference to actually terminating deployment, and XAI’s policy has plausibly already been violated as of early 2026.
Even Anthropic, which pioneered RSPs in collaboration with METR back in 2024, has updated their RSP to be substantially less strict than was outlined.
AI progress seems to be consistent with faster timelines
In 2024, there was a sense that AI progress was fast, but there wasn’t a good sense of exactly how fast it was. I think in 2026, it seems fairly clear that AI progress is fast enough that we can’t rule out ~full coding automation by 2028, let alone 2030 or 2035.
For one thing, the METR graph, originally released in March of 2025, found a consistent exponential trend in the length of tasks that AIs could complete, which has held up (or even accelerated) in the following year:
We’ve also just seen the massive use of AI coding assistants as well as massive scale ups in investment from leading labs, just to name two other signs of fast progress.
Insofar as our plans required more time, the faster timeline is definitely bad news.
Ambitious technical research has (largely) not paid out
In 2024, (ambitious) mechanistic interpretability was contested but still a major area of investment. There were also a bunch of other ambitious AI safety research agendas that aimed to rigorously tackle substantial parts of the AI alignment problem, such as ARC’s ELK/Heuristic Arguments Agenda, Singular Learning Theory, or non-agentic AI scientists. While work continues to be done on each of these agendas, none of these agendas has really paid off in a massive way.
The community has largely concentrated its investment into Anthropic
In 2024, Anthropic was probably the largest consumer of skill-weighted AI Safety talent of any organization in the world. However, it was still only a plurality, and not a majority.
I think in 2026, Anthropic is clearly a majority consumer of skill-weighted AI Safety talent. This isn’t clearly a bad thing in itself: insofar as the “make Anthropic win” plan is the main way to have the margin to invest in less-competitive but safer approaches, this is pretty much what would have to happen. But insofar as Anthropic as an organization has problems, for example by being wrong about key questions in AI safety, or driven to motivated thinking for profit or status reasons, or not able to efficiently use marginal talent, then this is a clearly bad thing from the perspective of the community.
Regardless of whether or not it was net good, I think it’s still sad that due to Anthropic’s commercial success (and several outflows of talent from OpenAI) the AI Safety community does not really have an independent existence outside of Anthropic.
The current US administration has many bad qualities from an AI Safety standpoint, and explicitly opposes "AI Safety"
On November 5th, after I saw the early results come in, I thought to myself that we were now playing on hard mode in terms of AI governance. I think my assessment has held up pretty well given what has happened since.
Many of the developments that I considered positive signs for domestic AI governance were explicitly opposed and or revoked by the current US Administration. The current US administration has also taken many actions that have made international cooperation on AI safety much less likely, including by antagonizing many of America’s existing allies. Also, key members of the administration have explicitly opposed AI Safety (and some have even espoused deranged conspiracy theories about how all of AI Safety is an attempt at regulatory capture by Anthropic).
The administration has just been exceedingly stupid, corrupt, and chaotic in general. (A recent example of this is designating Anthropic as a supply chain risk as a (failed) negotiating tactic.)