That seems... good? It seems to be a purely mundane-utility capability improvement. It doesn't improve on the architecture of the base LLM, and a base LLM that would be omnicide-capable wouldn't need the AGI labs to hold its hand in order to learn how to use the computer. It seems barely different from AutoGPT, or the integration of Gemini into Android, and is about as existentially dangerous as advances in robotics.
The only new dangers it presents are mundane ones, which are likely to be satisfactorily handled by mundane mechanisms.
It's bad inasmuch as this increases the attention to AI, attracts money to this sector, and increases competitive dynamics. But by itself, it seems fine. If all AI progress from this point on consisted of the labs racing in this direction, increasing the integration and the reliability of LLMs, this would all be perfectly fine and good.
I say Anthropic did nothing wrong in this one instance.
Anthropic did note that this advance ‘brings with it safety challenges.’ They focused their attentions on present-day potential harms, on the theory that this does not fundamentally alter the skills of the underlying model, which remains ASL-2 including its computer use. And they propose that introducing this capability now, while the worst case scenarios are not so bad, we can learn what we’re in store for later, and figure out what improvements would make computer use dangerous.
A safety take from a major AGI lab that actually makes sense? This is unprecedented. Must be a sign of the apocalypse.
The next major update can be Claude 4.0 (and Gemini 2.0) and after that we all agree to use actual normal version numbering rather than dating?
Date-based versions aren't the most popular, but it's not an unheard of thing that Anthropic just made up: see CalVer, as contrasted to SemVer. (For things that change frequently in small ways, it's convenient to just slap the date on it rather than having to soul-search about whether to increment the second or the third number.)
Also here’s Pliny encountering some bizarreness during the inevitable jailbreak explorations.
Seems like Pliny successfully sent a message to the future, received a response from the future, and the response was written by an incoherent chatbot.
The contrast between time-traveling skills and bad conversational skills is weird, but it makes sense if there are no humans in the future to provide feedback, so the conversational patterns of the future robots just drift randomly. When they discuss physics among themselves, they probably use equations or something.
If you're reading this direct, this text is the last one that is wise like what's written between.
This sounds like it tried to encode something steganographically in the message? Maybe that accounts for some of the bizarre language.
Having an expensive 3.5 Opus would be cool, but it's not my top wish. I'd prefer to have a variety of "flavors" of Sonnet. Different specializations for different use cases.
For example:
Science fiction writer / General Creative writer
Poet
Actor
Philosopher / Humanities professor
Chem/Bio professor
Math/Physics professor
Literary Editor
Coder
Lawyer/Political Science professor
Clerical worker for mundane repetitive tasks (probably should be a Haiku, actually)
The main things missing from Sonnet 3.5 that Opus 3 has are creativity, open mindedness, ability to analyze multi-sided complex philosophical questions better, ability to roleplay convincingly.
Why try to cram all abilities into one single model? Distilling down to smaller models seems like a perfect place to allow for specialization.
[Edit: less than a month later, my wish came true. Anthropic has added "communication styles" to Claude and I really like it. The built-in ones (concise, formal) work great. The roll-your-own-from-examples is rough around the edges still.]
I suspect fine-tuning specialized models is just squeezing a bit more performance in a particular direction, and not nearly as useful as developing the next-gen model. Complex reasoning takes more steps and tighter coherence among them (the o1 models are a step in this direction). You can try to devote a toddler to studying philosophy, but it won't really work until their brain matures more.
If system prompts aren't enough but fine-tuning is, this should be doable with different adapters that can be loaded at inference time; not needing to distill into separate models.
Yes, I agree that's an alternative. Then you'd need the primary model to be less RLHF'd and focused. A more raw model should be capable, with an adapter, of expressing a wider variety of behaviors.
I still think that distilling down from specialized large teacher models world likely give the best result, but that's just a hunch.
Anthropic has released an upgraded Claude Sonnet 3.5, and the new Claude Haiku 3.5.
They claim across the board improvements to Sonnet, and it has a new rather huge ability accessible via the API: Computer use. Nothing could possibly go wrong.
Claude Haiku 3.5 is also claimed as a major step forward for smaller models. They are saying that on many evaluations it has now caught up to Opus 3.
Missing from this chart is o1, which is in some ways not a fair comparison since it uses so much inference compute, but does greatly outperform everything here on the AIME and some other tasks.
We only have very early feedback so far, so it’s hard to tell how much what I will be calling Claude 3.5.1 improves performance in practice over Claude 3.5. It does seem like it is a clear improvement. We also don’t know how far along they are with the new killer app: Computer usage, also known as handing your computer over to an AI agent.
Table of Contents
OK, Computer
Letting an LLM use a computer is super exciting. By which I mean both that the value proposition here is obvious, and also that it is terrifying and should scare the hell out of you on both the mundane level and the existential one. It’s weird for Anthropic to be the ones doing it first.
Their central suggested use case is the automation of tasks.
It’s still early days, and they admit they haven’t worked all the kinks out.
Typical human level on OSWorld is about 75%.
They offer a demo asking Claude to look around including on the internet, find and pull the necessary data and fill out a form, and here’s another one planning a hike.
Where is your maximum 3% productivity gains over 10 years now? How do people continue to think none of this will make people better at doing things, over time?
If this becomes safe and reliable – two huge ifs – then it seems amazingly great.
This post explains what they are doing and thinking here.
What Could Possibly Go Wrong
If you give Claude access to your computer, things can go rather haywire, and quickly.
In case it needs to be said, it would be wise to be very careful what access is available to Claude Sonnet before you hand over control of your computer, especially if you are not going to be keeping a close eye on everything in real time.
Which it seems even its safety minded staff are not expecting you to do.
Anthropic did note that this advance ‘brings with it safety challenges.’ They focused their attentions on present-day potential harms, on the theory that this does not fundamentally alter the skills of the underlying model, which remains ASL-2 including its computer use. And they propose that introducing this capability now, while the worst case scenarios are not so bad, we can learn what we’re in store for later, and figure out what improvements would make computer use dangerous.
I do think that is a reasonable position to take. A sufficiently advanced AI model was always going to be able to use computers, if given the permissions to do so. We need to prepare for that eventuality. So many people will never believe an AI can do something it isn’t already doing, and this potentially could ‘wake up’ a bunch of people and force them to update.
The biggest concern in the near-term is the one they focus on: Prompt injection.
When I think of being a potential user here, I am terrified of prompt injection.
Is finding a serious vulnerability on day 1 a good thing, or a bad thing?
They also discuss misuse and have put in precautions. Mostly for now I’d expect this to be an automation and multiplier on existing misuses of computers, with the spammers and hackers and such seeing what they can do. I’m mildly concerned something worse might happen, but only mildly.
The biggest obvious practical flaw in all the screenshot-based systems is that they observe the screen via static pictures every fixed period, which can miss key information and feedback.
As for what can go wrong, here’s some ‘amusing’ errors.
I suppose ‘engineer takes a random break’ is in the training data? Stopping the screen recording is probably only a coincidence here, for now, but is a sign of things that may be to come.
The Quest for Lunch
Some worked to put in safeguards, so Claude in its current state doesn’t wreck things. They don’t want it to actually be used for generic practical purposes yet, it isn’t ready.
Others dove right in, determined to make Claude do things it does not want to do.
I don’t think this is primarily about litigation. I think it is mostly about actually not wanting people to shoot themselves in the foot right now. Still, I want lunch.
Aside: Someone Please Hire The Guy Who Names Playstations
Claude Sonnet 3.5 got a major update, without changing its version number. Stop it.
Somehow only Meta is doing a sane thing here, with ‘Llama 3.2.’ Perfection.
I am willing to accept Sam McAllister’s compromise here. The next major update can be Claude 4.0 (and Gemini 2.0) and after that we all agree to use actual normal version numbering rather than dating? We all good now?
I do not think this was related to Anthropic wanting to avoid attention on the computer usage feature, or avoid it until the feature is fully ready, although it’s possible this was a consideration. You don’t want to announce ‘big new version’ when your key feature isn’t ready, is only in beta and has large security issues.
All right. I just needed to get that off our collective chests. Aside over.
Coding
The core task these days seems to mostly be coding. They claim strong results.
Startups Get Their Periodic Reminder
Yes, OpenAI and Anthropic (and Google and Apple and so on) are going to have versions of their own autonomous agents that can fully use computers and phones. What parts of it do you want to compete with versus supplement? Do you want to plug in the agent mode and wrap around that, or do you want to plug in the model and provide the agent?
That depends on whether you think you can do better with the agent construction in your particular context, or in general. The core AI labs have both big advantages and disadvantages. It’s not obvious that you can’t outdo them on agents and computer use. But yes, that is a big project, and most people should be looking to wrap as much as possible as flexibly as possible.
Live From Janus World
While the rest of us ask questions about various practical capabilities or safety concerns or commercial applications, you can always count on Janus and friends to have a very different big picture in mind, and to pay attention to details others won’t notice.
It is still early, and like the rest of us they have less experience with the new model and have refined how to evoke the most out of old ones. I do think some such reports are jumping to conclusions too quickly – this stuff is weird and requires time to explore. In particular, my guess is that there is a lot of initial ‘checking for what has been lost’ and locating features that went nominally backwards when you use the old prompts and scenarios, whereas the cool new things take longer to find.
Then there’s the very strong objections to calling this an ‘upgrade’ to Sonnet. Which is a clear case of (I think) understanding exactly why someone cares so much about something that you, even having learned the reason, don’t think matters.
This is such a bizarre thing to worry about, especially given that the old version still exists, and is available in the API, even. I mean, I do get why one who was thinking in a different way would find the description horrifying, or the idea that someone would want to use that description horrifying, or find the idea of ‘continue modifying based on an existing LLM and creating something different alongside it’ horrifying. But I find the whole orientation conceptually confused, on multiple levels.
Also here’s Pliny encountering some bizarreness during the inevitable jailbreak explorations.
Forgot about Opus
We got Haiku 3.5. We conspicuously not only did not get Opus 3.5, we have this, where previously they said to expect Opus 3.5?
The speculations are that Opus 3.5 could have been any of:
As usual, the economist says if the issue is quality or compute then release it anyway, at least in the API. Let the users decide whether to pay what it actually costs. But one thing people have noted is that Anthropic has serious rate limit issues, including highly reachable chat message caps in chat. And in general it’s bad PR when you offer people something and they can’t have it, or can’t get that much of it, or think it’s too expensive. So yeah, I kind of get it.
The ‘too powerful’ possibility is there too, in theory. I find it unlikely, and even more highly unlikely they’d have something they can never release, but it could cause the schedule to slip.
If Opus 3.5 was even more expensive and slow than Opus 3, and only modestly better than Opus 3 or Sonnet 3.5, I would still want the option. When a great response is needed, it is often worth a lot, even if the improvement is marginal.
So as Adam says, if it’s an option: Charge accordingly. Make it $50/month and limit to 20 messages at a time, whatever you have to do.