I'll be interested in the followup posts.
I do feel something like "this was the Constellation Plan" as opposed to "the all of AI Safety field Plan". (Like a lot of Lightcone's work was to try to make sure other plans than this could also be happening)
have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which becomes both likely and effectively outside of our control?
I think you're missing a word after "which". But also, the "outside of our control" part seems like a bad definition of losing, insofar as there are other actors who might be able to steer things instead.
Glad to see these kinds of reflections in general, though.
Written very quickly for the Inkhaven Residency.
As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
Today I’ll start by explaining “the plan” as I understood it in 2024.
Tomorrow, I’ll explain why this question seems so salient to me, and why the situation looks much worse than when I was reflecting on this question two years ago in 2024. These reasons include: many of our governance and policy plans have failed (in ways that reflect poorly on my naivete in 2024), AI progress is going along more aggressively timelines, the community has largely went “all-in” on Anthropic and lost its independence, some of the more ambitious technical research plans have not paid out, and the political situation both domestically in the US and internationally is quite bad.
Then, the day after that, I’ll write out why I think the answer is no. First, there are reasons for optimism compared to my view in 2024, including: the situation on wing-it–style empirical alignment is a fair bit better than expected, it seems more likely to me that Anthropic will be able to achieve and maintain a lead, and I think it’s more likely that non-US governments will have leverage over the course of AI development. Many reasons for hope in 2024 also still apply, including the fact that almost no one wants to die to misaligned AI, and that the US public is incredibly skeptical of AI and big tech in general. I also think there are a fair bit of silver linings to many of the negative updates (as the quip goes, “sometimes bad things are good”). I conclude by briefly outlining some of the ways I think people like myself could still make a difference, which I hope to expand into a larger post in the near future.
The plan from 2024
A quick sketch of the plan for “victory” as I understood it in mid 2024:
Some of the key assumptions behind this plan include:
This suggested the following approaches:
To a large extent, the community did actually do the plan; the community put in a ton of effort into each of the above approaches.
But unfortunately, as I’ll write about tomorrow, not everything went according to plan.