OpenAI has declared ChatGPT Agent as High in Biological and Chemical capabilities under their Preparedness Framework
Huh. They certainly say all the right things here, so this might be a minor positive update on OpenAI for me.
Of course, the way it sounds and the way it is are entirely different things, and it's not clear yet whether the development of all these serious-sounding safeguards was approached with making things actually secure in mind, as opposed to safety-washing. E. g., are they actually going to stop anyone moderately determined?
Hm, it's been five minutes and it looks like there's no Pliny jailbreak yet. That's something. Maybe Pliny doesn't have access yet. (Edit: Yep.)
OpenAI now offers 400 shots of ‘agent mode’ per month to Pro subscribers.
This incorporates and builds upon OpenAI’s Operator. Does that give us much progress? Can it do the thing on a level that makes it useful?
So far, it does seem like a substantial upgrade, but we still don’t see much to do with it.
What Is The Thing?
The main claimed innovation is unifying Deep Research, Operator and ‘ChatGPT’ which might refer to o3 or to GPT-4o or both, plus they claim to have added unspecified additional tools. One key tool is it claims to be able to use connectors for apps like Gmail and GitHub.
As always with agents, one first asks, what do they think you will do with it?
What’s the pitch?
Okay, but what do you actually do with that? What are the things the agent does better than alternatives, and which the agent does well enough to be worth doing?
Can It Do The Thing?
That’s definitely a cool result and it helps us understand where Agent is useful. These are standardized tasks with a clear correct procedure that requires many steps and has various details to get right.
They also claim other strong results when given its full toolset, like 41.6% on Humanity’s Last Exam, 27.4% on FrontierMath (likely mainly due to web search?), 45.5% (still well below 71.3% for humans) on SpreadsheetBench, 68.9% on BrowseComp Agentic Browsing (versus 50% for o3 and 51.5% for OpenAI Deep Research) and various other measures of work where GPTAgent scored higher.
A more basic thing to do: Timothy Lee orders a replacement lightbulb from Amazon based on a picture, after giving final approval as per usual.
Access Granted
Access, both having too little and also having too much, is one of the more annoying practical barriers for agents running in a distinct browser. For now, the primary problem to worry about is having too little, or not retaining access across sessions.
Similarly, I understand they have a principle of not solving CAPTCHAs.
Access will always be an issue, since you don’t want to give full access but there are a lot of things you cannot do without it. We also have the same problem with human assistants.
Is The Thing Worth Doing?
Report!
A reasonably good intern is pretty useful.
Here’s one clearly positive report.
Comparison shopping seems like a great use case, you can easily have a default option, then ask it to comparison shop, and compare its solution to yours.
I mostly find myself in the same situation as Lukes.
As with all agentic or reasoning AIs, one worries about chasing the thumbs up, however otherwise this evaluation seems promising:
Danger, Will Robinson
OpenAI has declared ChatGPT Agent as High in Biological and Chemical capabilities under their Preparedness Framework. I am very happy to see them make this decision, especially with this logic:
That is exactly right. The time to use such safeguards is when you might need them, not when you prove you definitely need them. OpenAI joining Anthropic in realizing the moment is here should be a wakeup call to everyone else. I can see saying ‘oh Anthropic is being paranoid or trying to sell us something’ but it is not plausible that OpenAI is doing so.
Why do so many people not get this? Why do so many people think that if you put in safeguards and nothing goes wrong, then you made a mistake?
I actually think the explanation for such craziness is that you can think of it as either:
One could even say: The irresponsibility, like the cruelty, is the point.
Here are some more good things OpenAI are doing in this area:
It is hard to verify how effective or ‘real’ such efforts are, but again this is great, they are being sensibly proactive. I don’t think such an approach will be enough later on, but for now and for this problem, this seems great.
For most users, the biggest risks are highly practical overeagerness.
Another danger is prompt injections, which OpenAI says were a point of emphasis, along with continuing to ask for user confirmation for consequential actions and forcing the user to be in supervisory ‘watch mode’ for critical tasks, and refusal of actions deemed too high risk like bank transfers.
Some Potential Issues with MCP
While we are discussing agents and their vulnerabilities, it is worth highlighting some dangers of MCP. MCP is a highly useful protocol, but like anything else that exposes you to outside information it is not by default safe.
Thanks, Anthropic.
In all seriousness, this is not some way MCP is especially flawed. It is saying the same thing about MCP one should say about anything else you do with an AI agent, which is to either carefully sandbox it and be careful with its permissions, or only expose it to inputs from whitelisted sources that you trust.
So it goes, indeed:
Steve is in properly good spirits about the whole thing, and it sounds like he recovered without too much pain. But yeah, don’t do this.
Things are going to go wrong.
So Far a Whisper
Mostly the big news about GPT Agent is that it is not being treated as news. It is not having a moment. It does seem like at least a modest improvement, but I’m not seeing reports of people using it for much.
So far I’ve made one serious attempt to use it, to help with formatting issues across platforms. It failed utterly on multiple different approaches and attempts, introducing inserting elements in random places without fixing any of the issues even when given a direct template to work from. Watching its thinking and actions made it clear this thing is going to be slow and often take highly convoluted paths to doing things, but that it should be capable of doing a bunch of stuff in the right circumstance. The interface for interrupting it to offer corrections didn’t seem to be working right?
I haven’t otherwise been able to identify tasks that I’ve otherwise naturally needed to do, where this would be a better tool than o3.
I do plan on trying it on the obvious tasks like comparison shopping, booking plane tickets and ordering delivery, or building spreadsheets and parsing data, but so far I haven’t found a good test case.
That is not the right way to get maximum use from AI. It’s fine to ask ‘what that I am already doing can it do for me?’ but better to ask ‘what can it do that I would want?’
For now, I don’t see great answers to that either. That’s partly a skill issue on my part.
Might be only a small part, might be large. If you were me, what would you try?