The AIs are obviously fully (or almost fully) automating AI R&D and we're trying to do control evaluations.
Looks like it isn't specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic's claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
[This comment is based on this X/twitter thread I wrote]
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, if I just had access to the system card, I would be much more unsure. My view depends a lot on assuming some level of continuity from prior models (and assuming 4.5 Opus wasn't a big scale up relative to prior models), on other evidence (e.g. METR time horizon results), and on some pretty illegible things (e.g. making assumptions about evaluations Anthropic ran or about the survey they did).
Some specifics:
Generally, it seems like the current situation is that capability evals don't provide much assurance. This is partially Anthropic's fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.
I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.
(It would be kinda reasonable for them to clearly say "Look, evaluating capabilities well is too hard and we have bigger things to worry about, so we're going to half-ass this and make our best guess. This means we're not longer providing much/any assurance, but we think this is a good tradeoff given the situation.")
Some (quickly written) recommendations:
We're probably headed towards a regime of uncertainty and limited assurance. Right now is easy mode and we're failing to some extent.
I'm just literally assuming that Plan B involves a moderate amount of lead time via the US having a lead or trying pretty hard to sabotage China, this is part of the plan/assumptions.
I don’t believe that datacenter security is actually a problem (see another argument).
Sorry, is your claim here that securing datacenters against highly resourced attacks from state actors (e.g. China) is going to be easy? This seems like a crazy claim to me.
(This link you cite isn't about this claim, the link is about AI enabled cyber attacks not being a big deal because cyber attacks in general aren't that big of a deal. I think I broadly agree with this, but think that stealing/tampering with model weights is a big deal.)
The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.
What about the US trying somewhat hard to buy lead time, e.g., by sabotaging Chinese AI companies?
The framework treats political will as a background variable rather than a key strategic lever.
I roughly agree with this. It's useful to condition on (initial) political will when making a technical plan, but I agree raising political will is important and one issue with this perspective is it might incorrectly make this less salient.
Not a crux for me ~at all. Some upstream views that make me think "AI takeover but humans stay alive" is more likely and also make me think avoiding AI takeover is relatively easier might be a crux.
I expect a roughly 5.5 month doubling time in the next year or two, but somewhat lower seems pretty likely. The proposed timeline I gave consistent with Anthropic's predictions requires <1 month doubling times (and this is prior to >2x AI R&D acceleration, at least given my view of what you get at that level of capability).
I'd guess swe bench verified has an error rate around 5% or 10%. They didn't have humans baseline the tasks, just look at them and see if they seem possible.
Wouldn't you expect thing to look logistic substantially before full saturation?
If compute used for RL is comparable to compute used for inference for GDM and Anthropic, then serving to users might not be that important of a dimension. I guess it could be acceptable to have much slower inference for RL but not for serving to users.