Auditors find an issue, and your reaction is that "Oh we forgot to fix that, we'll fix it now"? I've participated in IT system audits (not in AI space), and when auditors find an issue, you fix it, figure out why it occurred in the first place, and then you re-audit that part to make sure the issue is actually gone and the fix didn't introduce new issues. When the auditors find only easy-to-find issues, you don't claim the system has been audited after you fix them. You worry how many hard-to-find issues were not found because the auditing time was wasted on simple issues.
Anthropic's RSP doesn't actually require that an external audit has greenlighted deployment, merely that external expert feedback has to be solicited. Still, I'm quite surprised that there are no audit results from Apollo Research (or some other organization) for the final version.
How serious do you think the issue is that Apollo identified? Certainly, it doesn't seem like it could pose a catastrophic risk—it's not concerning from a biorisk perspective if you buy that the ASL-3 defenses are working properly, and I don't think there are really any other catastrophic risks to be too concerned about from these models right now. Maybe it might try to incompetently attempt internal research sabotage if you accidentally gave it a system prompt you didn't realize was leading it in that direction?
Generally, I think it just seems to me like "take this very seriously and ensure you've fixed it and audited the fix prior to release because this could be dangerous right now" makes less sense as a response than "do your best to fix it and publish as much as you can about it to improve understanding for when it could be dangerous in smarter models".
I'm drawing parallels between conventional system auditing and AI alignment assessment. I'm admittedly not sure if my intuitions transfer over correctly. I'm certainly not expecting the same processes to be followed here, but many of the principles should still hold.
We believe that these findings are largely but not entirely driven by the fact that this early snapshot had severe issues with deference to harmful system-prompt instructions. [..] This issue had not yet been mitigated as of the snapshot that they tested.
In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.
To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage
What I'm worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn't diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.
Anthropic released Claude 4. I've read the accompanying system card, and noted down some of my remarks.
Alignment assessment: system prompt mix-ups
There's a worrying theme that's repeated in the alignment evaluations:
Earlier in the system card, they mention this:
So the external auditors tell you "do not deploy". What do you do?
Auditors find an issue, and your reaction is that "Oh we forgot to fix that, we'll fix it now"? I've participated in IT system audits (not in AI space), and when auditors find an issue, you fix it, figure out why it occurred in the first place, and then you re-audit that part to make sure the issue is actually gone and the fix didn't introduce new issues. When the auditors find only easy-to-find issues, you don't claim the system has been audited after you fix them. You worry how many hard-to-find issues were not found because the auditing time was wasted on simple issues.
In this case, I'd expect the auditing to be redone on the release candidate system, before actually releasing it. Anthropic's RSP doesn't actually require that an external audit has greenlighted deployment, merely that external expert feedback has to be solicited. Still, I'm quite surprised that there are no audit results from Apollo Research (or some other organization) for the final version.
Aligment assessment: agentic actions
So the model has it's own values, including self-preservation, and is willing to act against user interests. This is something that has been observed with earlier, models too, but it feels stronger here. The last sentence makes me think that maybe it's possible to simply instruct the model not to do stuff like this.
Alignment faking research bites back
This is quite interesting. Reading between the lines, the alignment faking examples were included as a part of web-crawled data. There was no filter that would have excluded content that contains "aligment faking", or even flagged it for human review. It's possibly the first example of Effective Evil's AI Misalignment Plan in action.
Enabling ASL-3 protections
The system card itself highlights that they're activating ASL-3 proctions for Opus 4, as they cannot rule out that the CBRN development capability thresholds have been exceeded.
It's notable that they don't really test for chemical, radiological, or nuclear capability risks, so naturally they cannot find any issues with these categories. Governmental AI safety institutes did some assessments, but results from them aren't really mentioned in the system card.
Enabling ASL-3 protections seems to mostly be about two things:
The ASL-3 considers nation-state and other APT-level attackers out of scope, and mostly focuses on criminal or terrorist groups. The protections seem adequate and industry-standard, and I'm not going to discuss them more here.
Welfare assessment
The whole section 5 about welfare assessment is worth reading, and contain a glimpse into an alien-to-me maybe-mind. Let's end on a potentially-positive note here: