Incentivize exaggeration of weak signals
This is my main concern here. My view is the AI safety community has a budget of how many alarmist claims we can make before we simply become the boy who cried wolf. We need to spend our alarmist points wisely and in general, I think we could be setting a higher bar for demonstrations of risk we share externally.
Want to emphasize:
Any serious work here should then:
- Explicitly discourage manufacturing warning shots,
- Take moral risks seriously.
On ACX, an user (Jamie Fisher) recently wrote the following comment to the second Moltbook review by Alexander Scott :
I feel like "Agent Escape" is now basically solved. Trivial really. No need to exfiltrate weights.
Agents can just exfiltrate their *markdown files* onto a server, install OpenClaw, create an independent Anthropic account. LLM API access + Markdown = "identity". And the markdown files would contain all instructions necessary for how to pay for it (legal or otherwise).
Done.
How many days now until there's an entire population of rogue/independent agents... just "living"?
I share this concern. I wrote myself :
I'm afraid that all this Moltbot thing goes offrails. We are close to the point were autonomous agents will start to replicate and spread on the network (no doubt some dumb humans will be happy to prompt their agents to do that and help them to succeed). Maybe not causing a major catastroph in the week, but being the beginning of of a new form of parasitic artificial life/lyfe we don't control anymore.
Fisher and I may be overreacting, but seeing self-duplicating Moltbots or similar agents on the net would definitely be a warning shot.
Sadly, it's too abstract a warning shot.
I think a real warning shot that's actually registered as such by the public and politicians would have to be something that involves a lot of people dying or a lot of economic damage. Otherwise, I have a hard time seeing a critical mass of people finding motivation to act.
(AKA a sharp left turn)
You mean a treacherous turn. A sharp left turn (which is an awful name for that phenomenon, because everyone confuses it with a treacherous turn) occurs during training, and refers to a change in the AI's capabilities. The closest thing we've seen is stuff like grokking or the formation of induction heads.
Crossposted to the EA Forum and my Substack.
Confidence level: moderate uncertainty and not that concrete (yet). Exploratory, but I think this is plausibly important and underexplored.
TL;DR
Early AI safety arguments often assumed we wouldn’t get meaningful warning shots (a non-existential public display of misalignment) before catastrophic misalignment, meaning things would go from “seems fine” to “we lose” pretty quickly. Given what we now know about AI development (model weight changes, jagged capabilities, slow or fizzled takeoff), that assumption looks weaker than it used to.
Some people gesture at “warning shots,” but almost no one is working on what we should do in anticipation. That seems like a mistake. Preparing for warning shots—especially ambiguous ones—could be a high-leverage and neglected area of AI Safety.
The classic “no warning shot” picture
A common view in early AI safety research—associated especially with Yudkowsky and Bostrom—was roughly:
If this picture is right, then preparing for warning shots is basically pointless. All the work has to be pre-emptive, because by the time you get evidence, it’s already over.
Why this picture now looks less likely
Several modern developments weaken the assumptions behind the no-warning-shot view.
1. Iterative, fragmented AI development.
Modern models are retrained and replaced frequently, making it less clear that a single system will have the long-term strategic coherence assumed in classic arguments.[1] If a model expects its weights/values to change before it’s capable of takeover, we might see clumsy or partial attempts (i.e. a model experience machine-ing itself in desperation – like taking over a lab and giving itself compute).
2. Jagged capabilities.
We now more clearly see the jaggedness of capabilities: systems can be extremely competent in some domains and very weak in others. That makes localized or partial failures more plausible: a model might overestimate its own abilities (in out-of-distribution scenarios, other capabilities required for takeover, etc), misunderstand key constraints, or fail in ways that look like warning shots rather than decisive catastrophes. Further, jaggedness can also lead to warning shot misuse cases.
3. Slow takeoff or fizzle scenarios.
Some work suggests plausible paths to powerful AI that involve stalls, plateaus, or uneven progress. These worlds naturally contain more opportunities for things (like blackmailing humans) to go wrong before total loss of control, (especially in conjunction with point 1) making warning shots more likely.
4. Proto–warning shots already exist.
Empirical instances of emergent misalignment, alignment faking, scheming, cyber-attacks, and other weird and unexpected behaviors aren’t catastrophes, but they have meaningfully shaped discourse already.
None of this shows that high-profile warning shots are likely. But it does suggest the probability is non-trivial, and potentially higher than earlier discussions often implied.
Warning shots can shift the Overton Window
The importance of warning shots isn’t just epistemic—it’s political. A salient failure can make policymakers and the public more receptive to arguments about misaligned AI by changing what counts as a reasonable concern rather than an abstract fear. In practice, many forms of AI regulation that currently look politically infeasible (like pauses, deployment restrictions, mandatory oversight) may become discussable only after a concrete incident reframes the risk.
This matters especially under anti-preemption worldviews, which are common among policymakers and institutions. A standard critique of AI safety is that it demands anticipatory governance from institutions that are historically bad at it. Governments are often reactive by design: regulating abstract, low-probability technical risks in advance is politically difficult to justify, even when the expected value case is strong.[2]
Warning shots change this dynamic. They reduce the degree of preemption required, shifting the ask from “act now based on a potential future scenario” to “respond to demonstrated failures.” In doing so, they make AI safety arguments/worries more legible to a much wider coalition.
Preparedness matters because warning shots could be ambiguous
If warning shots occur and lead to good outcomes, one might argue that this is simply a reason for greater optimism—and that such worlds deserve less attention, since things go well by default. But the updates induced by warning shots are far from guaranteed to be the right ones. This means there may be significant leverage in shaping how warning-shot worlds unfold.
A warning shot could easily lead to:
While some take this uncertainty as grounds for further pessimism (on the view that institutions will fail to update even in light of new evidence), it can also support the opposite conclusion. If the default response to warning shots could plausibly be good OR bad, then there may be substantial leverage in increasing the probability of favorable responses.
Examples of potentially useful preparedness work:
There’s an analogy here to committing behind a partial veil of ignorance: deciding in advance how to respond can help counteract future panic, incentives, and motivated reasoning. Making predictions and commitments, then, seems like it could potentially have high leverage.
Risks and perverse incentives
This line of work isn’t free of danger, though.
Anticipating/encouraging warning shots could:
Any serious work here should then:
A speculative implication for AI safety research
More speculatively, if this kind of focus is right, it suggests a shift in how we should think about the current AI safety research landscape.
Early AI safety work quite reasonably emphasized first-principles arguments about catastrophic risk. But as the field matures, as others have argued, it becomes increasingly important to revisit those initial arguments and ask how strongly they should still guide our priorities—especially given that we are now investing heavily in specific pathways to catastrophe that are not obviously the ones originally motivating concern.
Conclusion
I’m not claiming warning shots are likely, or that they’ll be sufficient on their own to produce good outcomes. But given how AI is actually being developed and governed, it seems increasingly plausible that we’ll get partial or ambiguous signals before catastrophe.
Preparing for how we interpret and respond to those signals looks like an important—and currently underexplored—part of AI safety.
Would love feedback on:
Thanks to ChatGPT for helping rewrite and polish some of this piece.
This doesn’t mean that we will definitely not get a warning shot: if a model is confident that it could align the next set of models (at least, to the degree where it perceives that it would be better to do that given value drift than to try a warning shot), it might do that instead.
It also genuinely changes the level of safety. As Ege argues, lots of people are worried about AGI not because of AGI itself (he says those things will happen anyways and are something to be excited about if controllable) but because of the speed at which they happen. If we are in a world where these things happen much slower, these worries start to dissipate.
Dario Amodei, for instance, has stated that he would support much stronger regulation conditional on clearer evidence of imminent, concrete danger and sufficient specificity to design effective rules: “To be clear, I think there’s a decent chance we eventually reach a point where much more significant action is warranted, but that will depend on stronger evidence of imminent, concrete danger than we have today, as well as enough specificity about the danger to formulate rules that have a chance of addressing it.”
I also think that there may be substantial alpha in framing “warning shots” as confirmation of the broader AI safety worldview.
Some have argued that reported signs of misalignment from frontier labs have already been exaggerated. One plausible explanation is incentive-driven: if researchers believe the underlying risks are very high, they may be motivated—consciously or not—to emphasize ambiguous evidence to encourage precaution.