I agree but don't feel very strongly. On Anthropic security, I feel even more sad about this.
If [the insider exception] does actually apply to a large fraction of technical employees, then I'm also somewhat skeptical that Anthropic can actually be "highly protected" from (e.g.) organized cybercrime groups without meeting the original bar: hacking an insider and using their access is typical!
I agree but don't feel very strongly.
As I say at the end, I don't particularly care about Anthropic's security commitments here, what I do care about is the RSP meaning anything at all!
And to be clear, my belief for a long time has been that the RSP was unlikely to have much predictive power over Anthropic's priorities, so part of my motivation here is establishing common knowledge about that, so people can push on other governance approaches that aren't relying on companies holding themselves to their RSPs.
I don't even know whether I think Anthropic having good security is good or bad for the world
Could you share the TL;DR for why this might be bad for the world?
It's plausible to me that Anthropic (and other frontier labs) having bad security is good because it deflates race dynamics (if your competitors can just steel your weights after you invest $100b into a training run, you will probably think twice). Bad cybersecurity means you can't capture as much of the economic value provided by a model.
Furthermore, "bad cybersecurity is the poor man's auditing agreement". If I am worried about a lab developing frontier models behind closed doors, then them having bad cybersecurity means other actors can use the stolen weights to check whether a model poses a national security risk to them, and intervene before it is too late.
Is this a terrible solution to auditing? Yes! Are we going to get something better by default? I really don't know, I think not any time soon?
TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, but IMO not that much ambiguity) to be robust to attacks from corporate espionage teams at companies where it hosts its weights. Anthropic seems unlikely to be robust to those attacks. Hence they are in violation of their RSP.
From the Anthropic RSP:
When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights.
We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains).
[...]
We will implement robust controls to mitigate basic insider risk, but consider mitigating risks from sophisticated or state-compromised insiders to be out of scope for ASL-3. We define “basic insider risk” as risk from an insider who does not have persistent or time-limited access to systems that process model weights. We define “sophisticated insider risk” as risk from an insider who has persistent access or can request time-limited access to systems that process model weights.
My best understanding is the RSP commits Anthropic to being robust to attackers from corporate espionage teams (as included in the list above).
The RSP mentions "insiders" as a class of person Anthropic is promising less robustness from, but doesn't fully define the term. Everyone I have talked to about the RSP interpreted "insiders" to mean "people who work at Anthropic". I think it would be a big stretch for "insiders" to include "anyone working at any organization we work with that has persistent access to systems that process model weights". As such, I think it's pretty clear that "insiders" is not intended to include e.g. Amazon AWS employees, or Google employees.
Claude agrees with this interpretation: https://claude.ai/share/b7860f42-bef1-4b28-bf88-8ca82722ce82
One could potentially make the argument that Google, Microsoft and Amazon should be excluded on the basis of the "highly sophisticated attacker" carve-out:
The following groups are out of scope for the ASL-3 Security Standard because further testing (as discussed below) should confirm that the model would not meaningfully increase their ability to do harm: state-sponsored programs that specifically target us (e.g., through novel attack chains or insider compromise) and a small number (~10) of non-state actors with state-level resourcing or backing that are capable of developing novel attack chains that utilize 0-day attacks.
Multiple people I talked to thought this was highly unlikely. These attacks would not be nation-state backed, would not require developing novel attack chains that use 0-day attacks, and if you include Amazon and Google in this list of non-state actors, it seems very hard to limit the total list of organizations with that amount of cyber security offense capacity or more to "~10".
Again, Claude agrees: https://claude.ai/share/a6068000-0a82-4841-98e9-457c05379cc2
Based on the availability of Claude in a wide variety of AWS regions, it appears that Claude weights, for the purpose of inference, are shipped to a large number of Amazon data centers.
Similarly, based on the availability of Claude in a wide variety of Google Cloud regions, the weights are shipped to a large number of Google data centers.
Furthermore, based on the just-announced availability of Claude on Microsoft Foundry, the weights are shipped to a large number of Microsoft data centers.
This suggests strongly that Claude weights are being processed in ordinary and normal Google, Amazon and Microsoft data centers, without the kind of extensive precautions that are e.g. part of things like the high-security governance clouds.
As an example, I think it's quite unlikely Anthropic has access to cameras or direct access to verifiably untamperable access logs of who had physical access to all inference machines that host Claude weights (and even if they do, would have little ability to confirm the routing and physical setup of the machines to make sure they are not being lied to about the physical locations of the servers, camera coverage, or accuracy of the access logs).
I think the above implies that if a corporate espionage team was given the order by a high-level executive (which seems usually the case when corporate espionage teams do things) to extract Claude's weights, they would have virtually unlimited physical access to machines that host them, by e.g. making minor modifications to access logs or redirecting network traffic for provisioning a new instance to a different machine, etc. in at least one datacenter that Amazon and Google run.
Protecting a machine against privilege escalation or at least getting dumps of its memory, if you have unlimited physical access to the machine, is extremely difficult. Anthropic has written some about what would need to be done to make that closer to impossible in this report: https://assets.anthropic.com/m/c52125297b85a42/original/Confidential_Inference_Paper.pdf
Most of the things in the report are not currently standard practice in data centers, and the framing of the report (as well as its timing) reads to me as indicating that the data centers that Anthropic uses are not fully compliant with the recommendations that Anthropic is asking for.
Given that the RSP commits Anthropic to being robust to attacks from corporate espionage teams that are not part of a very small number of nation-state backed or otherwise extremely sophisticated hacking teams, it seems that the fact that Amazon and Google corporate espionage teams could currently extract Claude weights without too much difficulty (if at great business and reputation risk), would put Anthropic in violation of its RSP.
To be clear, don't think this is that reckless of a choice (I don't even know whether I think Anthropic having good security is good or bad for the world). It merely seems to me that the RSP as written commits Anthropic to a policy here that seems incompatible with what Anthropic is actually doing.
My best guess is starting today, with the announcement of Anthropic's partnership with Microsoft and Nvidia to host Claude on Microsoft data centers, that Anthropic is actually in much more intense violation of its RSP, where not only could a team at Google and Amazon get access to Claude weights with executive buy-in, my guess is many non-nation-state-capability-actors are now able to get access to Claude's weights.
This is because, as I understand, Microsoft datacenter security is generally known to be substantially worse than Google's or Amazon's, and this at least to me creates substantial suspicion that a broader set of attackers is capable of breaking those defenses.
I am really not confident of this, which is why it's here in this postscript. Miles Brundage expressing concern about this is what caused me to clean up the above, which was originally a message I sent to an Anthropic employee, for public consumption. It appears the current default trajectory for the past few months have been for Anthropic to weaken, not strengthen, its robustness to external attackers, so it seemed more urgent to publish.
Previous discussion on this topic can be found on this quick take by Zach.