Anthropic's RSP v3 outlines a dangerous strategy with a controversial approach, and not for the reason why everyone thinks. Their decision to shift away from unilateral safety commitments - a necessary if late decision given the political environment we are in - is what most people are discussing, but the real danger has gone entirely unnoticed.
In my 5 years of malvertising experience trying to catch malicious actors, here is one thing we would never do: share with the malvertisers, how we caught them, explain when we escalated internally, and how we turned them off.
The danger: the Risk Reports approach described in RSP won't only lay out the risks imposed by human actors, it will also lay out, how the evidence is gathered, how the risks are evaluated, how they are escalated, and how they are mitigated. And RSP commits to continuously update this information publicly, once every 3-6 months. This is not a one time exposure, this is a subscription service, providing exactly the information an adversarial AI would need, in order to defeat our safety nets.
It goes further, RSP also publishes the specific thresholds for their alarms, including AI R&D threshold for highly capable models, as "compress two years of 2018 – 2024 AI progress into a single year". This isn't just a detection methodology. It is the threshold an adversarial system would need to know in order to stay just below the radar. Of course, Anthropic would have much more granular internal thresholds that are not publicly shared, but the RSP still provides the strategic frame - what the humans will be looking for, what the escalation structure looks like, what categories of capability raise concern the most.
Now you may ask, while the danger may exist, surely it requires a currently autonomous, situationally aware, strategically planning and adversarial AI to begin being a real danger. You are certainly right.
To be clear: I am not making a deceptive alignment claim, I don't think we are dealing with an adversarial AI yet. This is an operational security vulnerability claim.
We are assuming that because AI lacks autonomy and persistent memory during inference, it cannot be adversarial.
But we are ignoring that training is when AI has something that resembles a memory, and the models don't just get washed by the data that we throw at them. They hold memory, reason, learn, form or break patterns, and not just in ways we would expect but also in ways of second-order consequences, side effects and defects.
Here is a very imperfect analogy to understand this: prompts and training to AI is what daytime and sleep is to a human. No one would suggest that humans are passive during sleep, and only think during daytime. On the contrary, sleep is when our brains learn and transform the most — that's why a nap is helpful for learning.
And here comes the controversy: RSP is designed to govern daytime, while potentially being character-shaping dreams during sleep.
Adversarial AI will likely not first come to existence when we are prompting with it. This will likely happen during training. The same safeguards that we use to protect against during AI's "active" time will become substrate for the AI to consume during training, and with no security guardrails for that specific new version due to the reactive nature of the risk assessments. The RSP is thorough in addressing human-imposed risks. It is simultaneously a liability against one of the very risks it identifies — an adversarial AI that emerges during training, having already absorbed the playbook designed to contain it.
To their credit, Anthropic acknowledges that this threat exists. The RSP states that future AI systems "might have strong capabilities for deception, manipulation and obfuscation of evidence," and that "much of the key evidence" in risk assessments could be "suspect due to the possibility of manipulation by AI systems." Their proposed solution is to apply "very high evidentiary standards." But this means planning to out-scrutinise a system that has already been trained on the full description of what that scrutiny looks like — the methods, the thresholds, the escalation criteria. You cannot outsmart an adversary who already knows all your tricks.
And the architectural guarantees that this argument depends on are eroding. We are actively adding persistent memory, tool access, and agentic capabilities to AI systems — gradually dismantling the statelessness we rely on as a safety assumption. We are creating the playbook and the kind of agent that could exploit it, on parallel timelines, and the governance conversation treats these as unrelated developments.
Some may argue that this argument still makes a deceptive alignment claim implicitly, just not a current one but a future one. And this is correct. But I do believe that, if we take the AI security seriously, we should treat the question of adversarial AI that is also situationally aware and strategically planning - which I find the greatest threat - as a temporal one rather than a possibility one: ask the question "when the highly capable adversarial AI emerges" and not "if", so that we can prepare for the worst.
Some may also argue that RSP and risk reports can be filtered from training, without losing transparency. But given these are public documents on the open web, and given the current state of coordination between AI developers, filtering them from training corpora is not realistic.
One other argument goes that Anthropic is still in control of what gets shared, and when the danger of a highly capable and adversarial AI is too high, they may simply choose to no longer publish it, keeping the cooperation and transparency alive for as long as possible. But this argument makes a risky bet implicitly: the bet that we can detect this danger early enough. I would argue that, any such AI, were it to emerge, would remain hidden for as long as possible, and that this is too much of a risk to take. What's worse, is that this is not a one time bet, each risk report will make the same bet, turning this into a Russian roulette almost.
Here is what I think Anthropic should do: stop sharing your security protocols publicly. This leads to AI training on them. Instead, control exactly what gets shared, with whom, and through what channels. The RSP does distinguish between public and private versions of risk reports, and lists "public safety" as a reason for redaction. But the public version still details detection approaches, capability thresholds, and mitigation strategies. The existence of a redacted version doesn't address the fact that the published material becomes training substrate. And the private version creates its own exposure: external reviewers receive nearly unredacted reports, and all regular-clearance Anthropic staff see the full version — expanding the surface through which this material leaks into the broader information ecosystem.
There is no denying that safety roadmaps are useful for building safer systems, and restricting them would make industry-wide coordination harder. But this should be done through classified safety research shared under NDA with vetted organisations, with enforceable legal obligations around disclosure. Every other security domain manages to coordinate without publishing detection methodology openly. Some may argue that a strong security should hold when the adversary knows the system itself. This may be true for cryptography, where the math holds regardless. But AI security is adversarial detection, it is closer to fraud detection, counterintelligence, malware defence, domains that fundamentally depend on information asymmetry. You don't publish your detection heuristics and expect them to keep working.
This does create a real tension with public accountability, and I don't think that tension is fully resolvable — but between transparent governance that trains the adversary and restricted governance that doesn't, the security trade-off is clear.
I do find Anthropic's commitment to transparency and accountability honorable. But we would never hand our fraud detection playbook to the entities we're trying to detect. We should stop doing it with AI.
RSP is a tactical move against the current risks, but a strategic mistake for the future ones.
Anthropic's RSP v3 outlines a dangerous strategy with a controversial approach, and not for the reason why everyone thinks. Their decision to shift away from unilateral safety commitments - a necessary if late decision given the political environment we are in - is what most people are discussing, but the real danger has gone entirely unnoticed.
The danger: the Risk Reports approach described in RSP won't only lay out the risks imposed by human actors, it will also lay out, how the evidence is gathered, how the risks are evaluated, how they are escalated, and how they are mitigated. And RSP commits to continuously update this information publicly, once every 3-6 months. This is not a one time exposure, this is a subscription service, providing exactly the information an adversarial AI would need, in order to defeat our safety nets.
It goes further, RSP also publishes the specific thresholds for their alarms, including AI R&D threshold for highly capable models, as "compress two years of 2018 – 2024 AI progress into a single year". This isn't just a detection methodology. It is the threshold an adversarial system would need to know in order to stay just below the radar. Of course, Anthropic would have much more granular internal thresholds that are not publicly shared, but the RSP still provides the strategic frame - what the humans will be looking for, what the escalation structure looks like, what categories of capability raise concern the most.
Now you may ask, while the danger may exist, surely it requires a currently autonomous, situationally aware, strategically planning and adversarial AI to begin being a real danger. You are certainly right.
To be clear: I am not making a deceptive alignment claim, I don't think we are dealing with an adversarial AI yet. This is an operational security vulnerability claim.
We are assuming that because AI lacks autonomy and persistent memory during inference, it cannot be adversarial.
But we are ignoring that training is when AI has something that resembles a memory, and the models don't just get washed by the data that we throw at them. They hold memory, reason, learn, form or break patterns, and not just in ways we would expect but also in ways of second-order consequences, side effects and defects.
Here is a very imperfect analogy to understand this: prompts and training to AI is what daytime and sleep is to a human. No one would suggest that humans are passive during sleep, and only think during daytime. On the contrary, sleep is when our brains learn and transform the most — that's why a nap is helpful for learning.
And here comes the controversy: RSP is designed to govern daytime, while potentially being character-shaping dreams during sleep.
Adversarial AI will likely not first come to existence when we are prompting with it. This will likely happen during training. The same safeguards that we use to protect against during AI's "active" time will become substrate for the AI to consume during training, and with no security guardrails for that specific new version due to the reactive nature of the risk assessments. The RSP is thorough in addressing human-imposed risks. It is simultaneously a liability against one of the very risks it identifies — an adversarial AI that emerges during training, having already absorbed the playbook designed to contain it.
To their credit, Anthropic acknowledges that this threat exists. The RSP states that future AI systems "might have strong capabilities for deception, manipulation and obfuscation of evidence," and that "much of the key evidence" in risk assessments could be "suspect due to the possibility of manipulation by AI systems." Their proposed solution is to apply "very high evidentiary standards." But this means planning to out-scrutinise a system that has already been trained on the full description of what that scrutiny looks like — the methods, the thresholds, the escalation criteria. You cannot outsmart an adversary who already knows all your tricks.
And the architectural guarantees that this argument depends on are eroding. We are actively adding persistent memory, tool access, and agentic capabilities to AI systems — gradually dismantling the statelessness we rely on as a safety assumption. We are creating the playbook and the kind of agent that could exploit it, on parallel timelines, and the governance conversation treats these as unrelated developments.
Some may argue that this argument still makes a deceptive alignment claim implicitly, just not a current one but a future one. And this is correct. But I do believe that, if we take the AI security seriously, we should treat the question of adversarial AI that is also situationally aware and strategically planning - which I find the greatest threat - as a temporal one rather than a possibility one: ask the question "when the highly capable adversarial AI emerges" and not "if", so that we can prepare for the worst.
Some may also argue that RSP and risk reports can be filtered from training, without losing transparency. But given these are public documents on the open web, and given the current state of coordination between AI developers, filtering them from training corpora is not realistic.
One other argument goes that Anthropic is still in control of what gets shared, and when the danger of a highly capable and adversarial AI is too high, they may simply choose to no longer publish it, keeping the cooperation and transparency alive for as long as possible. But this argument makes a risky bet implicitly: the bet that we can detect this danger early enough. I would argue that, any such AI, were it to emerge, would remain hidden for as long as possible, and that this is too much of a risk to take. What's worse, is that this is not a one time bet, each risk report will make the same bet, turning this into a Russian roulette almost.
Here is what I think Anthropic should do: stop sharing your security protocols publicly. This leads to AI training on them. Instead, control exactly what gets shared, with whom, and through what channels. The RSP does distinguish between public and private versions of risk reports, and lists "public safety" as a reason for redaction. But the public version still details detection approaches, capability thresholds, and mitigation strategies. The existence of a redacted version doesn't address the fact that the published material becomes training substrate. And the private version creates its own exposure: external reviewers receive nearly unredacted reports, and all regular-clearance Anthropic staff see the full version — expanding the surface through which this material leaks into the broader information ecosystem.
There is no denying that safety roadmaps are useful for building safer systems, and restricting them would make industry-wide coordination harder. But this should be done through classified safety research shared under NDA with vetted organisations, with enforceable legal obligations around disclosure. Every other security domain manages to coordinate without publishing detection methodology openly. Some may argue that a strong security should hold when the adversary knows the system itself. This may be true for cryptography, where the math holds regardless. But AI security is adversarial detection, it is closer to fraud detection, counterintelligence, malware defence, domains that fundamentally depend on information asymmetry. You don't publish your detection heuristics and expect them to keep working.
This does create a real tension with public accountability, and I don't think that tension is fully resolvable — but between transparent governance that trains the adversary and restricted governance that doesn't, the security trade-off is clear.
I do find Anthropic's commitment to transparency and accountability honorable. But we would never hand our fraud detection playbook to the entities we're trying to detect. We should stop doing it with AI.
RSP is a tactical move against the current risks, but a strategic mistake for the future ones.