AI Safety Has 12 Months Left

mhdempsey

The past decade of technology has been defined by many wondering what the upper bound of power and influence is for an individual company. The core concern about AI labs is that the upper bound is infinite.^[1]

This has led investors to direct all of their mindshare towards deploying into AI, the tech industry to become incredibly politically engaged, the markets to have a minimal amount of conviction (and a lot of volatility), and governments to attempt to classify labs as borderline domestic terrorist organizations.

The core realization is that we are watching productivity shift from labor to capital. When that completes, lab incentives will permanently transition from appeasing the researchers who build the models to accumulating capital for the shareholders who own them. We must ask what levers of influence are left before that window slams shut.^[2]

AI Safety is perhaps the last line of defense in this shift.

There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible. Every mechanism that currently constrains how labs behave besides compute is either already dead or expiring, and what gets built in that window has to survive years of market pressure on the other side. The people who care about this have a rapidly shrinking amount of leverage to work with but perhaps in the past ~week have been handed a specific opportunity to reclaim some level of importance or opportunity.

Safety lost the market

While romantic at the start, the AI Safety movement’s one real commercial lever was the bet that safety-focus would matter to users, to regulators, and to capital allocators. This was an incorrect assumption.

We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance. Viewed purely as a consumer tax, safety teams simply slow releases, create unsolvable problems, and leak internal political drama.^[3]

By failing to pitch safety as an enterprise premium, the movement lost the internal economic argument (and in fairness, before agents that economic argument was debatable at best). In a race to deploy the most capable models fastest, safety is just a measurable cost protecting against risks nobody can yet prove exist.

Compounding this, safety also demands compute at a time in which talent is fleeing the largest labs due to lack of compute for their research direction, instead opting to raise the standard $1B on $4B round for a neolab that everyone believes is downside protected.

Labor as the last constraint

This brings us back to people.

The AI labs have always been nothing without their people (and compute), and this ideological alignment (centered around AI safety behavior as well as perhaps enterprise vs. consumer) has impacted hiring dynamics, with Anthropic seeing noticeably lower churn and a shift in vibes that could create real worker leverage.

The recent standoff with the Department of War, in which the “We Will Not Be Divided” letter gathered hundreds of cross-company signatures across Google and OpenAI in support of Anthropic’s red lines, showed that cross-company solidarity is at least possible and perhaps opened a window again for AI safety.

When your institutions are captured and your market incentives point the wrong direction, the only constraint left is the people who build the thing refusing to build it without conditions.

But labor leverage only works when talent demand greatly outpaces supply.^[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.

The obvious response to “labor leverage is expiring” is “automate safety research before it expires and the problem solves itself.”

This seductive version of the future is (mostly) wrong, and the research coming out of the labs themselves explains why we will want humans repeatedly in the loop.

Why you can’t automate your way out

Safety is conceptually least likely to get automated because the problem is definitionally human(ity) vs. machine.

Models are getting better at recognizing when they’re being evaluated and modifying their behavior accordingly as shown by OpenAI/Apollo noticing o3 underperforming on evals to not restrict its deployment, alongside Anthropic’s own alignment faking research demonstrating Claude intentionally pretending to comply with training objectives to avoid modification.

Capability gains only make this worse. As the model gets more performant, it could get better at being quietly adversarial, and could increase complexity to a point where our dumb human minds can’t comprehend what is or is not happening on any reasonable timescale.

Evan Hubinger (one of Anthropic’s alignment leads) published a detailed assessment in November laying out why the hard parts of alignment haven’t been encountered yet. His argument is that current models are aligned, but only because we’re still in the easy regime where we can directly verify outputs and inspect reasoning. The problems that will actually be difficult (overseeing systems smarter than you, ensuring models generalize correctly on tasks you can’t check, long-horizon RL selecting for power-seeking agents) haven’t meaningfully arrived yet. His first-order solution is to use the latest generation of trustworthy models as automated alignment researchers before we encounter the hard problems, and then to build other models to tackle the hard problems as well.

Marius Hobbhahn’s What’s the short timeline plan? makes a similar point, stating that the minimum viable safety plan requires frontier models to be used overwhelmingly for alignment and security research, with the lab willing to accept months of delay for release for scaled inference.

These are reasonable plans to a non-AI safety researcher like myself, and yet they are almost certainly unrealistic. Nobody looking to destroy public companies on a near-daily basis with product releases and obsolete engineers in 12 months will want to point their most capable models at the speculative Nick Bostrom-style future that may or may not come true.

So my gut tells me on any reasonable timescale, you can’t fully automate your way to proper safety.

Where we are

Go back to that solidarity letter, because it’s instructive as a high-water mark and could open up an Overton window to establish possible changes.

The letter represents the best possible conditions for safety-as-labor-leverage by possessing a relatively binary moral issue (mass surveillance, autonomous weapons), a clear antagonist, cross-company support, and Anthropic’s main competitor calling the SCR designation “an extremely scary precedent” while trying to de-escalate.

The result? Anthropic got designated a supply chain risk. OpenAI got the contract. The red lines are still debatably unclear if they have survived or not as of the last 10 tweets we’ve seen from Sam as he (artfully at times, poorly at others) works through AMAs and not-so-subtweets from multiple employees at his company.

While the letter brought certain types of “Ideological Safety” to the surface within the AI industry, nobody is going to organize a cross-company petition over eval methodology for deceptive alignment. And you can be damn sure that nobody is building a notdivided.org for compute allocation to interpretability research.

The realistic safety questions going forward look more like “how much red-teaming is enough before we ship a model that’s better at long-horizon coding tasks and allows us to steamroll Claude Code/Codex usage for 4-8 weeks?”^[5]

This come-from-behind battle that Anthropic is seemingly winning has coincided with (and one could argue accelerated) the dismantling of the principled commitments that got us here. Anthropic removed the hard-pause provision from its Responsible Scaling Policy to maintain momentum. ^[6]OpenAI is not too dissimilar, having a revolving door of talent in their preparedness team alongside the dramatic shutting down of their AGI readiness group in 2024.

And broadly, most binding constraints that underpinned lab safety work are effectively gone, with the latest blog post from Anthropic potentially reading as another form of capitulation as some of the longer-standing talking points of each of these labs were quietly taken out back in a matter of weeks as competition intensified and “faster take off” scenarios emerged.^[7]

Markets will market

Both Anthropic and OpenAI are likely to go public in the next 6-18 months. The two organizations that employ the majority of the world’s frontier safety researchers will answer to public shareholders at exactly the moment when the best plan for safety requires dedicating your most capable model to safety research instead of scaling agents.

The revenue growth justifying each of the labs’ valuations lives in exactly the use cases where safety review is slowest and most likely to delay a launch^[8] and while ideology is great, as we’ve seen, nothing threatens talent moats like bad vibes or a feeling of under-execution, which will be more clear than ever as each labs’ stock trades second by second.^[9]

Once hundreds of thousands of investors are pricing lab equity instead of five, one naturally turns to governments as the next gating mechanism.

A Georgetown paper on AI emergency preparedness shows just how unprepared most are, and across the EU (compliance deadlines pushed to 2027-2028), UK^[10] (voluntary cooperation that labs already ignore), China (state-driven frameworks with different goals entirely), and the US (which revoked its own executive order and used the DPA against the company that held the line), there are very few answers for a broad-based consensus that AI capabilities are outpacing safety measures and resource constraints.

Twelve months…at best

Markets always win on a long enough time horizon. Every other constraint on corporate behavior (labor, regulation, institutional commitments, public pressure) either gets priced in, arbitraged away, or restructured into something that serves growth. AI safety is not going to be the exception.

The question is whether the people who understand the problem can seize this fleeting window of lab skepticism and moral questioning to restructure safety from a moral tax into a commercial moat. Mechanisms that depend on executive goodwill, talent scarcity, or voluntary commitments have a half-life measured in quarters once companies answer to public shareholders.^[11]

To survive the transition to public markets, safety must be priced into the asset. Whatever gets built in the next year has to be embedded in how enterprise deployments are certified, how strict SLAs are met, and how uninsurable liability flows when an autonomous system fails. The kind of thing that persists because it is net-dominant for the business and for the lab to continue to climb up its infinite upside scenario. Again, this has always been the dream “race to the top” scenario.

Twelve months is not a lot of time to rewire the market structure of an entire industry (and maybe a large portion of societal views). But it is possibly the last time in history the people who care will have any leverage to try.

^{^}
In a similar way that people believed VR was the last device because every other device could live inside of it, AI labs could be the last compounding assets because every other asset can exist within the weights.
^{^}
I quite enjoyed Finn’s take on these labor dynamics.
^{^}
At times i wonder if this would be the case if OpenAI didn’t seem to employ the most fucking Online people in all of tech
^{^}
The main point here really is that AI researchers care about alignment/safety generally, and thus them voting with their feet matters a lot to the perhaps lesser valued AI Safety people
^{^}
The one possible structural exception is Google, where antitrust exposure and consumer brand risk accidentally make safety a corporate survival concern. A high-profile safety failure at Google becomes a breakup catalyst in ways that a failure at Anthropic or OpenAI does not. Whether that translates to real alignment investment or better PR is an open question, but it may be the only major AI company where the political and market incentives point in the same direction on safety.
^{^}
I would highly recommend reading Holden’s post contextualizing RSP v3
^{^}
RIP “race to the top”
^{^}
Things like autonomous agents, models in critical infrastructure, AI systems making consequential decisions without human oversight, etc.
^{^}
I recognize some of the irony in mentioning talent moats, when I’m also arguing that talent is rapidly losing its power.
^{^}
A good post on someone who talked to many members of parliament on AI safety.
^{^}
To be clear, that might just mean create a social construct that then works into actually built systems or regulation, or it might mean gunning for more compute resource commitments that set industry standards. There are many half-baked ideas.

But labor leverage only works when talent demand greatly outpaces supply.^[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.

All of this seems irrelevant if the only reason AI safety research is being done at all is to keep the superstar capabilities researchers happy. There is no meaningful talent scarcity if investors don't actually care about the outputs, and it doesn't matter if safety research gets more automated if safety researchers never had any leverage in the first place.

The only point where labor dynamics really shift is if the superstar capabilities researchers get automated, and at that point we need to have already succeeded since our AI overlords will be making the decisions.

I'd appreciate if downvoters could leave comments. I'm an upvoter. What's the problem you see with this post?

edit: possible clarification: I am not saying "please upvote this post, it's actually good". I want to know the disagreements.

I downvoted because:

It starts with an irrelevant AI image. Why? And then the "how hard is AI safety" image is embedded in the middle of unrelated text.
It conflates different things under a single word "safety", eg:

We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance.

While I could be wrong, that also seems like AI-edited text.

It has "12 months" in the title and then has no justification for that particular number.

The image is fun and nice. And the how hard is ai safety image is a good mental model to keep people engaged with problem as they read.
Yes safety and alignment are catch all terms but the teams have cascading impacts onto product, from preachiness to aesthetic to annoyance in enterprise to "move fast and break things" style of deployment for ARR ramp. I don't think this was AI-edited AFAIK.
12 months is a barometer, the ending discusses how it could be shorter, but a key anchor point is that both labs are heading for IPOs and commercial progress (which is inhabited by safety today, in theory) is a key determinant. I think it's quite likely within 12 months at least 1 lab is public.

I didn't downvote but I also didn't upvote since it was unclear to me what the takeaway from this post is. AI safety researchers already have no leverage, and if they knew of a way to get it they would be doing that already.

The post also seems to be concerned with an in-between point that I don't think actually exists, where AI can do one of the most complex jobs in the world on its own and also safety research is still relevant.

AI safety researchers already have no leverage

This is really, really false.

Things AI researchers can do (ripping off the top of my head, at 4:08am):

contact local journalists, just talk plainly about their concerns
advise law firms on how to be more successful in suing ai companies - yes this affects things - this affects the most important thing in the next few months - money to ai companies
write official looking things like https://www.citriniresearch.com/p/2028gic, AI 2027, etc that make stocks go down and a bubble pop come sooner

And most of all:

think very hard, for at least 10 minutes, on what they think they know about how AI companies get money and what affects it - learn from LLMs how exactly AI companies get the money to do large training runs and then think hard for 10 minutes at least, on a clock, on what they can do to reduce this. If you have actually done this, have a google doc with notes that you can share, to show your thinking, and still think you have no leverage, send it to me, DM me and I will help. My signal is kabstastically.07

and yes, obviously, there's things im not including here

I was responding to a post about labor leverage due to talent scarcity.

There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible.

We've all been hearing about the upcoming IPOs, but is there a canonical/official reference for them?

Many signals to companies going public but there has not been an official date or statement yet. I have high confidence one will be public in next 12 months at least.

But labor leverage only works when talent demand greatly outpaces supply.^[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.