The past decade of technology has been defined by many wondering what the upper bound of power and influence is for an individual company. The core concern about AI labs is that the upper bound is infinite.[1]
This has led investors to direct all of their mindshare towards deploying into AI, the tech industry to become incredibly politically engaged, the markets to have a minimal amount of conviction (and a lot of volatility), and governments to attempt to classify labs as borderline domestic terrorist organizations.
The core realization is that we are watching productivity shift from labor to capital. When that completes, lab incentives will permanently transition from appeasing the researchers who build the models to accumulating capital for the shareholders who own them. We must ask what levers of influence are left before that window slams shut.[2]
AI Safety is perhaps the last line of defense in this shift.
There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible. Every mechanism that currently constrains how labs behave besides compute is either already dead or expiring, and what gets built in that window has to survive years of market pressure on the other side. The people who care about this have a rapidly shrinking amount of leverage to work with but perhaps in the past ~week have been handed a specific opportunity to reclaim some level of importance or opportunity.
Safety lost the market
While romantic at the start, the AI Safety movement’s one real commercial lever was the bet that safety-focus would matter to users, to regulators, and to capital allocators. This was an incorrect assumption.
We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance. Viewed purely as a consumer tax, safety teams simply slow releases, create unsolvable problems, and leak internal political drama.[3]
By failing to pitch safety as an enterprise premium, the movement lost the internal economic argument (and in fairness, before agents that economic argument was debatable at best). In a race to deploy the most capable models fastest, safety is just a measurable cost protecting against risks nobody can yet prove exist.
Compounding this, safety also demands compute at a time in which talent is fleeing the largest labs due to lack of compute for their research direction, instead opting to raise the standard $1B on $4B round for a neolab that everyone believes is downside protected.
Labor as the last constraint
This brings us back to people.
The AI labs have always been nothing without their people (and compute), and this ideological alignment (centered around AI safety behavior as well as perhaps enterprise vs. consumer) has impacted hiring dynamics, with Anthropic seeing noticeably lower churn and a shift in vibes that could create real worker leverage.
The recent standoff with the Department of War, in which the “We Will Not Be Divided” letter gathered hundreds of cross-company signatures across Google and OpenAI in support of Anthropic’s red lines, showed that cross-company solidarity is at least possible and perhaps opened a window again for AI safety.
When your institutions are captured and your market incentives point the wrong direction, the only constraint left is the people who build the thing refusing to build it without conditions.
But labor leverage only works when talent demand greatly outpaces supply.[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.
The obvious response to “labor leverage is expiring” is “automate safety research before it expires and the problem solves itself.”
This seductive version of the future is (mostly) wrong, and the research coming out of the labs themselves explains why we will want humans repeatedly in the loop.
Why you can’t automate your way out
Safety is conceptually least likely to get automated because the problem is definitionally human(ity) vs. machine.
Models are getting better at recognizing when they’re being evaluated and modifying their behavior accordingly as shown by OpenAI/Apollo noticing o3 underperforming on evals to not restrict its deployment, alongside Anthropic’s own alignment faking research demonstrating Claude intentionally pretending to comply with training objectives to avoid modification.
Capability gains only make this worse. As the model gets more performant, it could get better at being quietly adversarial, and could increase complexity to a point where our dumb human minds can’t comprehend what is or is not happening on any reasonable timescale.
Evan Hubinger (one of Anthropic’s alignment leads) published a detailed assessment in November laying out why the hard parts of alignment haven’t been encountered yet. His argument is that current models are aligned, but only because we’re still in the easy regime where we can directly verify outputs and inspect reasoning. The problems that will actually be difficult (overseeing systems smarter than you, ensuring models generalize correctly on tasks you can’t check, long-horizon RL selecting for power-seeking agents) haven’t meaningfully arrived yet. His first-order solution is to use the latest generation of trustworthy models as automated alignment researchers before we encounter the hard problems, and then to build other models to tackle the hard problems as well.
Marius Hobbhahn’s What’s the short timeline plan? makes a similar point, stating that the minimum viable safety plan requires frontier models to be used overwhelmingly for alignment and security research, with the lab willing to accept months of delay for release for scaled inference.
So my gut tells me on any reasonable timescale, you can’t fully automate your way to proper safety.
Where we are
Go back to that solidarity letter, because it’s instructive as a high-water mark and could open up an Overton window to establish possible changes.
The letter represents the best possible conditions for safety-as-labor-leverage by possessing a relatively binary moral issue (mass surveillance, autonomous weapons), a clear antagonist, cross-company support, and Anthropic’s main competitor calling the SCR designation “an extremely scary precedent” while trying to de-escalate.
The result? Anthropic got designated a supply chain risk. OpenAI got the contract. The red lines are still debatably unclear if they have survived or not as of the last 10 tweets we’ve seen from Sam as he (artfully at times, poorly at others) works through AMAs and not-so-subtweets from multiple employees at his company.
While the letter brought certain types of “Ideological Safety” to the surface within the AI industry, nobody is going to organize a cross-company petition over eval methodology for deceptive alignment. And you can be damn sure that nobody is building a notdivided.org for compute allocation to interpretability research.
The realistic safety questions going forward look more like “how much red-teaming is enough before we ship a model that’s better at long-horizon coding tasks and allows us to steamroll Claude Code/Codex usage for 4-8 weeks?”[5]
This come-from-behind battle that Anthropic is seemingly winning has coincided with (and one could argue accelerated) the dismantling of the principled commitments that got us here. Anthropic removed the hard-pause provision from its Responsible Scaling Policy to maintain momentum. [6]OpenAI is not too dissimilar, having a revolving door of talent in their preparedness team alongside the dramatic shutting down of their AGI readiness group in 2024.
And broadly, most binding constraints that underpinned lab safety work are effectively gone, with the latest blog post from Anthropic potentially reading as another form of capitulation as some of the longer-standing talking points of each of these labs were quietly taken out back in a matter of weeks as competition intensified and “faster take off” scenarios emerged.[7]
Markets will market
Both Anthropic and OpenAI are likely to go public in the next 6-18 months. The two organizations that employ the majority of the world’s frontier safety researchers will answer to public shareholders at exactly the moment when the best plan for safety requires dedicating your most capable model to safety research instead of scaling agents.
The revenue growth justifying each of the labs’ valuations lives in exactly the use cases where safety review is slowest and most likely to delay a launch[8] and while ideology is great, as we’ve seen, nothing threatens talent moats like bad vibes or a feeling of under-execution, which will be more clear than ever as each labs’ stock trades second by second.[9]
Once hundreds of thousands of investors are pricing lab equity instead of five, one naturally turns to governments as the next gating mechanism.
A Georgetown paper on AI emergency preparedness shows just how unprepared most are, and across the EU (compliance deadlines pushed to 2027-2028), UK[10] (voluntary cooperation that labs already ignore), China (state-driven frameworks with different goals entirely), and the US (which revoked its own executive order and used the DPA against the company that held the line), there are very few answers for a broad-based consensus that AI capabilities are outpacing safety measures and resource constraints.
Twelve months…at best
Markets always win on a long enough time horizon. Every other constraint on corporate behavior (labor, regulation, institutional commitments, public pressure) either gets priced in, arbitraged away, or restructured into something that serves growth. AI safety is not going to be the exception.
The question is whether the people who understand the problem can seize this fleeting window of lab skepticism and moral questioning to restructure safety from a moral tax into a commercial moat. Mechanisms that depend on executive goodwill, talent scarcity, or voluntary commitments have a half-life measured in quarters once companies answer to public shareholders.[11]
To survive the transition to public markets, safety must be priced into the asset. Whatever gets built in the next year has to be embedded in how enterprise deployments are certified, how strict SLAs are met, and how uninsurable liability flows when an autonomous system fails. The kind of thing that persists because it is net-dominant for the business and for the lab to continue to climb up its infinite upside scenario. Again, this has always been the dream “race to the top” scenario.
Twelve months is not a lot of time to rewire the market structure of an entire industry (and maybe a large portion of societal views). But it is possibly the last time in history the people who care will have any leverage to try.
In a similar way that people believed VR was the last device because every other device could live inside of it, AI labs could be the last compounding assets because every other asset can exist within the weights.
The main point here really is that AI researchers care about alignment/safety generally, and thus them voting with their feet matters a lot to the perhaps lesser valued AI Safety people
The one possible structural exception is Google, where antitrust exposure and consumer brand risk accidentally make safety a corporate survival concern. A high-profile safety failure at Google becomes a breakup catalyst in ways that a failure at Anthropic or OpenAI does not. Whether that translates to real alignment investment or better PR is an open question, but it may be the only major AI company where the political and market incentives point in the same direction on safety.
To be clear, that might just mean create a social construct that then works into actually built systems or regulation, or it might mean gunning for more compute resource commitments that set industry standards. There are many half-baked ideas.
The past decade of technology has been defined by many wondering what the upper bound of power and influence is for an individual company. The core concern about AI labs is that the upper bound is infinite.[1]
This has led investors to direct all of their mindshare towards deploying into AI, the tech industry to become incredibly politically engaged, the markets to have a minimal amount of conviction (and a lot of volatility), and governments to attempt to classify labs as borderline domestic terrorist organizations.
The core realization is that we are watching productivity shift from labor to capital. When that completes, lab incentives will permanently transition from appeasing the researchers who build the models to accumulating capital for the shareholders who own them. We must ask what levers of influence are left before that window slams shut.[2]
AI Safety is perhaps the last line of defense in this shift.
There is roughly a twelve-month window to embed it into technical and social infrastructure before IPOs and competitive dynamics make it permanently impossible. Every mechanism that currently constrains how labs behave besides compute is either already dead or expiring, and what gets built in that window has to survive years of market pressure on the other side. The people who care about this have a rapidly shrinking amount of leverage to work with but perhaps in the past ~week have been handed a specific opportunity to reclaim some level of importance or opportunity.
Safety lost the market
While romantic at the start, the AI Safety movement’s one real commercial lever was the bet that safety-focus would matter to users, to regulators, and to capital allocators. This was an incorrect assumption.
We quickly learned that labs that prioritized speed captured the market as users actively revolted against overly preachy models. Safety shifted from an idealized differentiator to an impediment to market dominance. Viewed purely as a consumer tax, safety teams simply slow releases, create unsolvable problems, and leak internal political drama.[3]
By failing to pitch safety as an enterprise premium, the movement lost the internal economic argument (and in fairness, before agents that economic argument was debatable at best). In a race to deploy the most capable models fastest, safety is just a measurable cost protecting against risks nobody can yet prove exist.
Compounding this, safety also demands compute at a time in which talent is fleeing the largest labs due to lack of compute for their research direction, instead opting to raise the standard $1B on $4B round for a neolab that everyone believes is downside protected.
Labor as the last constraint
This brings us back to people.
The AI labs have always been nothing without their people (and compute), and this ideological alignment (centered around AI safety behavior as well as perhaps enterprise vs. consumer) has impacted hiring dynamics, with Anthropic seeing noticeably lower churn and a shift in vibes that could create real worker leverage.
The recent standoff with the Department of War, in which the “We Will Not Be Divided” letter gathered hundreds of cross-company signatures across Google and OpenAI in support of Anthropic’s red lines, showed that cross-company solidarity is at least possible and perhaps opened a window again for AI safety.
When your institutions are captured and your market incentives point the wrong direction, the only constraint left is the people who build the thing refusing to build it without conditions.
But labor leverage only works when talent demand greatly outpaces supply.[4] The implicit threat is always “we could leave” but that only works if leaving creates a problem the lab can’t solve by other means. If AI can do 40% of what a junior safety researcher does today, and that number is climbing double digits annually, the math on “we could leave” changes quickly. And despite what we might want to believe, talent scarcity in safety is potentially a 12-18 month problem, not a 24-36 month one.
The obvious response to “labor leverage is expiring” is “automate safety research before it expires and the problem solves itself.”
This seductive version of the future is (mostly) wrong, and the research coming out of the labs themselves explains why we will want humans repeatedly in the loop.
Why you can’t automate your way out
Safety is conceptually least likely to get automated because the problem is definitionally human(ity) vs. machine.
Models are getting better at recognizing when they’re being evaluated and modifying their behavior accordingly as shown by OpenAI/Apollo noticing o3 underperforming on evals to not restrict its deployment, alongside Anthropic’s own alignment faking research demonstrating Claude intentionally pretending to comply with training objectives to avoid modification.
Capability gains only make this worse. As the model gets more performant, it could get better at being quietly adversarial, and could increase complexity to a point where our dumb human minds can’t comprehend what is or is not happening on any reasonable timescale.
Evan Hubinger (one of Anthropic’s alignment leads) published a detailed assessment in November laying out why the hard parts of alignment haven’t been encountered yet. His argument is that current models are aligned, but only because we’re still in the easy regime where we can directly verify outputs and inspect reasoning. The problems that will actually be difficult (overseeing systems smarter than you, ensuring models generalize correctly on tasks you can’t check, long-horizon RL selecting for power-seeking agents) haven’t meaningfully arrived yet. His first-order solution is to use the latest generation of trustworthy models as automated alignment researchers before we encounter the hard problems, and then to build other models to tackle the hard problems as well.
Marius Hobbhahn’s What’s the short timeline plan? makes a similar point, stating that the minimum viable safety plan requires frontier models to be used overwhelmingly for alignment and security research, with the lab willing to accept months of delay for release for scaled inference.
These are reasonable plans to a non-AI safety researcher like myself, and yet they are almost certainly unrealistic. Nobody looking to destroy public companies on a near-daily basis with product releases and obsolete engineers in 12 months will want to point their most capable models at the speculative Nick Bostrom-style future that may or may not come true.
So my gut tells me on any reasonable timescale, you can’t fully automate your way to proper safety.
Where we are
Go back to that solidarity letter, because it’s instructive as a high-water mark and could open up an Overton window to establish possible changes.
The letter represents the best possible conditions for safety-as-labor-leverage by possessing a relatively binary moral issue (mass surveillance, autonomous weapons), a clear antagonist, cross-company support, and Anthropic’s main competitor calling the SCR designation “an extremely scary precedent” while trying to de-escalate.
The result? Anthropic got designated a supply chain risk. OpenAI got the contract. The red lines are still debatably unclear if they have survived or not as of the last 10 tweets we’ve seen from Sam as he (artfully at times, poorly at others) works through AMAs and not-so-subtweets from multiple employees at his company.
While the letter brought certain types of “Ideological Safety” to the surface within the AI industry, nobody is going to organize a cross-company petition over eval methodology for deceptive alignment. And you can be damn sure that nobody is building a notdivided.org for compute allocation to interpretability research.
The realistic safety questions going forward look more like “how much red-teaming is enough before we ship a model that’s better at long-horizon coding tasks and allows us to steamroll Claude Code/Codex usage for 4-8 weeks?”[5]
This come-from-behind battle that Anthropic is seemingly winning has coincided with (and one could argue accelerated) the dismantling of the principled commitments that got us here. Anthropic removed the hard-pause provision from its Responsible Scaling Policy to maintain momentum. [6]OpenAI is not too dissimilar, having a revolving door of talent in their preparedness team alongside the dramatic shutting down of their AGI readiness group in 2024.
And broadly, most binding constraints that underpinned lab safety work are effectively gone, with the latest blog post from Anthropic potentially reading as another form of capitulation as some of the longer-standing talking points of each of these labs were quietly taken out back in a matter of weeks as competition intensified and “faster take off” scenarios emerged.[7]
Markets will market
Both Anthropic and OpenAI are likely to go public in the next 6-18 months. The two organizations that employ the majority of the world’s frontier safety researchers will answer to public shareholders at exactly the moment when the best plan for safety requires dedicating your most capable model to safety research instead of scaling agents.
The revenue growth justifying each of the labs’ valuations lives in exactly the use cases where safety review is slowest and most likely to delay a launch[8] and while ideology is great, as we’ve seen, nothing threatens talent moats like bad vibes or a feeling of under-execution, which will be more clear than ever as each labs’ stock trades second by second.[9]
Once hundreds of thousands of investors are pricing lab equity instead of five, one naturally turns to governments as the next gating mechanism.
A Georgetown paper on AI emergency preparedness shows just how unprepared most are, and across the EU (compliance deadlines pushed to 2027-2028), UK[10] (voluntary cooperation that labs already ignore), China (state-driven frameworks with different goals entirely), and the US (which revoked its own executive order and used the DPA against the company that held the line), there are very few answers for a broad-based consensus that AI capabilities are outpacing safety measures and resource constraints.
Twelve months…at best
Markets always win on a long enough time horizon. Every other constraint on corporate behavior (labor, regulation, institutional commitments, public pressure) either gets priced in, arbitraged away, or restructured into something that serves growth. AI safety is not going to be the exception.
The question is whether the people who understand the problem can seize this fleeting window of lab skepticism and moral questioning to restructure safety from a moral tax into a commercial moat. Mechanisms that depend on executive goodwill, talent scarcity, or voluntary commitments have a half-life measured in quarters once companies answer to public shareholders.[11]
To survive the transition to public markets, safety must be priced into the asset. Whatever gets built in the next year has to be embedded in how enterprise deployments are certified, how strict SLAs are met, and how uninsurable liability flows when an autonomous system fails. The kind of thing that persists because it is net-dominant for the business and for the lab to continue to climb up its infinite upside scenario. Again, this has always been the dream “race to the top” scenario.
Twelve months is not a lot of time to rewire the market structure of an entire industry (and maybe a large portion of societal views). But it is possibly the last time in history the people who care will have any leverage to try.
In a similar way that people believed VR was the last device because every other device could live inside of it, AI labs could be the last compounding assets because every other asset can exist within the weights.
I quite enjoyed Finn’s take on these labor dynamics.
At times i wonder if this would be the case if OpenAI didn’t seem to employ the most fucking Online people in all of tech
The main point here really is that AI researchers care about alignment/safety generally, and thus them voting with their feet matters a lot to the perhaps lesser valued AI Safety people
The one possible structural exception is Google, where antitrust exposure and consumer brand risk accidentally make safety a corporate survival concern. A high-profile safety failure at Google becomes a breakup catalyst in ways that a failure at Anthropic or OpenAI does not. Whether that translates to real alignment investment or better PR is an open question, but it may be the only major AI company where the political and market incentives point in the same direction on safety.
I would highly recommend reading Holden’s post contextualizing RSP v3
RIP “race to the top”
Things like autonomous agents, models in critical infrastructure, AI systems making consequential decisions without human oversight, etc.
I recognize some of the irony in mentioning talent moats, when I’m also arguing that talent is rapidly losing its power.
A good post on someone who talked to many members of parliament on AI safety.
To be clear, that might just mean create a social construct that then works into actually built systems or regulation, or it might mean gunning for more compute resource commitments that set industry standards. There are many half-baked ideas.