ChatGPT Agent: evals and safeguards

Zach Stein-Perlman

OpenAI released ChatGPT Agent last week. I read the system card, then added a page on it to my beta website AI Safety Claims Analysis. AI Safety Claims Analysis is mostly a reference work for AI safety professionals; as far as I know, it's the only resource on companies' dangerous capability evals and planned responses to dangerous capabilities. When I make a new page, I'll generally make a blogpost like this. Blogposts should be briefer and more accessible, but they still assume substantial context — if you're not familiar with the idea of dangerous capability evals, maybe see my introduction.^[1] I'm interested in feedback on how the site is helpful or could be more helpful to you.

Summary

ChatGPT Agent performs similarly to past models like o3 on OpenAI's dangerous capability evals. OpenAI says it might have High capability in bio, but not cyber or AI R&D. According to OpenAI's Preparedness Framework, this means that OpenAI is supposed to implement the High standard of misuse safeguards and security controls. For the first time, OpenAI's safeguards are load-bearing: OpenAI says the system is safe because of its safeguards rather than just because it lacks dangerous capabilities.

On misuse, OpenAI implements some safeguards and shows that they are moderately robust. On security, OpenAI probably thinks it's meeting its High standard but it's ambiguous and OpenAI doesn't publish details. On misalignment, this isn't news but OpenAI's planning is concerning and it doesn't seem to be doing autonomy evals even though they're central to its planning.

Overall, I think all this is of similar quality to Google DeepMind and Anthropic (but Anthropic is better in other ways), and it is better than all other AI companies.

OpenAI does model evals for bio, cyber, and AI R&D capabilities. It says the model might have dangerous capabilities in bio — that is, it might be able to meaningfully uplift novices in creating biothreats. This seems true; in fact, I think OpenAI hasn't ruled out dangerous capabilities in its past models — o3, released in April, outperformed 94% of expert virologists on virology questions related to their specialties, and OpenAI has never explained why it thinks o3 doesn't provide meaningful uplift. OpenAI says the model doesn't have dangerous capabilities in cyber or AI R&D. But the elicitation is dubious and the interpretation of eval results is very unclear — it's very unclear how OpenAI thinks eval results translate to risk or what results would be concerning to OpenAI. OpenAI has done evals for scheming capabilities and misalignment propensity in the past but apparently didn't for ChatGPT Agent.

On misuse, OpenAI explains its plan for preventing misuse: use post-training and a monitor to prevent users from receiving answers to dangerous questions, plus rapidly fix vulnerabilities when it becomes aware of them, plus ban offending users. It shows that the post-training and monitor are moderately robust: it reports that the post-training averts bad outputs for 88% and 97% of inputs on two tests sets and reports 84% for the monitor on another set. But sophisticated users may well be more successful. OpenAI reports external red-teaming, including by UK AISI and FAR.AI; red-teamers found vulnerabilities and OpenAI fixed them so now it's unclear how hard it is to find vulnerabilities.

I think the shakiest assumption in OpenAI's misuse safety case is that users won't be able to persistently jailbreak the system. I also don't buy that banning users will be effective; I expect they can just use multiple accounts.

(In general, I'm not so worried about misuse via API: I believe more danger comes from other threats (including misalignment and model weights being stolen), misuse via API is relatively easy to solve, and before the risk is existential everyone will notice that it's serious. But it's not nothing, and regardless it's worth checking whether companies are saying true things and meeting their own standards.)

On security, OpenAI isn't explicit on whether OpenAI thinks it has implemented High security as defined in the PF. The system card has one vacuous paragraph on security. In general, the system card says OpenAI implements "associated safeguards" or "safeguards consistent with High capability models"; it's ambiguous whether security is included (modulo that based on the PF it should absolutely be included). It would be nice if OpenAI said whether it claims to have implemented High security controls. (But the standard is weak/vague and a claim to have implemented it is basically unfalsifiable from the outside.)

On misalignment, OpenAI didn't say anything. As a reminder, its planned response to misalignment risk is concerning, and the capabilities that trigger misalignment safeguards are ambiguous, and the main relevant capability according to OpenAI seems to be "autonomy" but OpenAI doesn't seem to measure this capability and hasn't said anything about this.

For more, see the full page: http://aisafetyclaims.org/companies/openai.

Crossposted from the AI Lab Watch blog; subscribe on Substack.

^{^}
I'm interested in a better thing-to-link-to, or suggestions for making my thing better.

15

ChatGPT Agent: evals and safeguards

15

Summary

15

15