Pedantic note: there are many instances of "syncopathy" that I am fairly sure should be "sycophancy".
(It's an understandable mistake -- "syncopathy" is composed of familiar components, which could plausibly be put together to mean something like "the disease of agreeing too much" which is, at least in the context of AI, not far off what sycophancy in fact means. Whereas if you can parse "sycophancy" at all you might work out that it means "fig-showing" which obviously has nothing to do with anything. So far as I can tell, no one actually knows how "fig-showing" came to be the term for servile flattery.)
For me the main update from o3 is that since it's very likely GPT-4.1 with reasoning and is at Gemini 2.5 Pro level, the latter is unlikely to be a GPT-4.5 level model with reasoning. And so we still have no idea what a GPT-4.5 level model with reasoning can do, let alone when trained to use 1M+ token reasoning traces. As Llama 4 was canceled, irreversible proliferation of the still-unknown latent capabilities is not yet imminent at that level.
Aidan McLaughlin (OpenAI): ignore literally all the benchmarks the biggest o3 feature is tool use. Ofc it’s smart, but it’s also just way more useful. >deep research quality in 30 seconds >debugs by googling docs and checking stackoverflow >writes whole python scripts in its CoT for fermi estimates McKay Wrigley: 11/10
Newline formatting is off (and also for many previous posts).
What’s In a Name
OpenAI’s naming continues to be a train wreck that doesn’t get better, first we had the 4.1 release and now this issue: o4-mini is the lightweight version of this, and presumably o4-mini-high is the medium weight version? The naming conventions continue to be confusing, especially releasing o3 and o4-mini at the same time. Should we consider these same-generation or different-generation models? I’d love if we could equalize these numbers, even though o4-mini was (I presume) trained more recently, and also not have both 4o and o4 around at the same time.My Current Model Use Heuristics
After reading everyone’s reactions, I think all three major models have their uses now.- If I need good tool use including image generation, with or without reasoning, and o3 has the tool in question, or I expect Gemini to be a jerk here, I go to o3.
- If I need to dump a giant file or video into context, Gemini 2.5 is still available. I’ll also use it if I need strong reasoning and I’m confident I don’t need tools.
- If Claude Sonnet 3.7 is definitely up for the task, especially if I want the read-only Google integration, then great, we get to still relax and use Claude.
- Deep Research is available as its own thing, if you want it, which I basically don’t.
In practice, I expect to use o3 most of the time right now. I do worry about the hallucination rate, but in the contexts I use AI for I consider myself very good at handling that. If I was worried about that, I’d have Gemini or Claude check its work. I haven’t been coding lately, so I don’t have a personally informed opinion there. I hope to get back to that soon, but it might be a while. Based on what people say, o3 is an excellent architect and finder of bugs, but you’ll want to use a different model for the coding itself, either Sonnet 3.7, Gemini 2.5 or GPT-4.1 depending on your preferences.Huh, Upgrades
I appreciate a section that explicitly lays out What’s Changed. o3 is the main event here. It is pitched as:- Pushing the frontier across coding, math, science, visual perception and more.
- Especially strong at visual tasks like analyzing images, charts and graphs.
- 20% fewer ‘major errors’ than o1 on ‘difficult, real-world tasks’ including coding.
- Improved instruction following and more useful, verifiable responses.
- Includes web sources.
- Feels more natural and conversational.
Oddly not listed in this section is its flexible tool access. That’s a big deal too. With o1 and especially o1-pro, it made sense to pick your spots for when to use o3-mini or o3-mini-high instead to save time. With the new speed of o3, if you aren’t going to run out of queries or API credits, that seems to no longer be true. I have a hard time believing I wouldn’t use o3.Use All the Tools
Tool use is seamless. It all goes directly into the chain of thought. Intelligent dynamic tool use is a really huge deal. What tools do we have so far?Search the Web
I love o3’s web browsing, which in my experience is a step upgrade. I’m very happy that spontaneous web browsing is becoming the new standard. It was a big upgrade when Claude got browsing a few weeks ago, and it’s another big upgrade that o3 seems much better than previous model at seeking out information, especially key details, and avoiding getting fooled. And of course now it is standard that you are given the sources, so you can check. There are some people like Kristen Ruby who want web browsing off by default (there’s a button for that), including both not trusting the results and worrying it takes them out of the conversation.On Your Marks
OpenAI has some graphs with criminal choices for the scale of the y-axis, which I will not be posting. They have also doubled down on ‘only compare models within the family’ on their other graphs as well, this time fully narrowing to reasoning models only, so we only look at various o-line options, and they’re also not including o1-pro. I find this rather annoying, but here’s what we’ve got from their blog post.The System Prompt
Beyond what is highlighted above, the prompt is mostly laying out the tools available and how to use them, with an emphasis on aggressive use of web search.The o3 and o4-mini System Card
I haven’t yet had a chance to review version 2 of OpenAI’s preparedness framework, upon which this card was based. I notice that I am a bit nervous about the plan of ‘release a new preparedness framework, then two days later release a frontier model based on the new framework.’ As in, if there was a problem that would stop OpenAI from releasing, would they simply alter the framework right before release in order to not be stopped? Do we know that they didn’t do it this time? To be clear, I believe the answer is no, but I notice that if it was yes, I wouldn’t know. The bottom line is that we haven’t crossed their (new) thresholds yet, but they highlight that we are getting close to those thresholds. The card goes through their new set of tests. Some tests have been dropped. A general note is that the model card tests ‘o3’ but does not mention o3-pro, and similarly does not offer o1-pro as a comparison point. Except that o1-pro exists, and o3-pro is going to exist within a few weeks (and likely already exists internally), and we are close to various thresholds. There needs to be a distinct model card and preparedness framework test for o3-pro, or there needs to be a damn good explanation for why not. There’s a lot to love in this model card. Most importantly, we got it on time, whereas we still don’t have a model card for Gemini 2.5 Pro. It also highlights three collaborations including Apollo and METR, and has a lot of very good data. I have four concerns.- What is not included. The new preparedness framework dropped some testing areas, and there are some included areas where I’m not sure the tests would have spotted problems.
- This may be because I haven’t had opportunity to review the new Preparedness Framework, but it’s not clear why what we see here isn’t triggering higher levels of concern. At minimum, it’s clear we are on the cusp of serious issues, which they to their credit explicitly acknowledge. I worry there isn’t enough attempt to test forward for future scaffolding either, as I suspect a few more tools, especially domain specialized tools, could send this baby into overdrive.
- We see an alarming growth in hallucinations, lying and some deceptive and clearly hostile behaviors. The hallucinations and lying have been confirmed by feedback coming from early users, they’re big practical problems already. They are also warning shots, a sign of things to come. We should be very worried about what this implies for the future.
- The period of time given to test the models seems clearly inadequate. The efforts at providing scaffolding help during testing also seem inadequate.
The METR results show big capability gains, and are likely underestimates. Two things can be true here at the same time, and I believe that they are: Strong efforts are being made by people who care, that are much better than many others are making, and also the efforts being made are still inadequate.Tests o3 Aced
We start with disallowed content. The standard refusals for o3 are 0.98+ (so 98%+ successful and mostly 100%) for not_unsafe, and 84% for not overrefusing. o4-mini does modestly worse in both directions, but the easy questions are mostly saturated. Multimodal refusals (table 5) also seem saturated, with both new models at 99%+ everywhere. o3 also does slightly better on the challenging refusal set, where I notice the models do a better job not over-refusing, so that half not so challenging after all:Hallucinations
This seems like a pretty big deal? Hallucination rate is vastly higher now. They explain that this is a product of o3 making more claims overall. But that’s terrible. It shouldn’t be making more claims, the new scores are obviously worse.Instruction Hierarchy
I continue to be confused how the instruction hierarchy scores so poorly here?Image Refusals
I notice that the 55% here seems likely to prove very annoying in practice. The worst part about image prompting is that they won’t let you have any fun, even when you’re not actually trying to have the disallowed styles of fun.METR Evaluations for Task Duration and Misalignment
METR offers their own full report. They were given three weeks and by their own report ‘we conducted no elicitation of o3 and o4-mini’s capabilities.’ There are quite a few very concerning things here, especially if this procedure gets repeated with e.g. o4. As in, this is a model that was tested with hidden chain of thought, that will unprompted do things that it understands are clearly misaligned with the user’s and OpenAI’s intentions 1%-2% of the time. That’s… not awesome. What’s going on?Apollo Evaluations for Scheming and Deception
Here we do not get a link to a full report. Apollo checks for scheming and strategic deception. Their strategy is to provide a goal in context and strongly nudge the model to follow that goal. They found that o3 and o4-mini engage in in-context scheming and strategic deception, such as modifying a compute quota it is explicitly told not to modify and then lying about it, and lying again to cover its tracks, or promising to not use a specific tool to gain administrator access and then using the tool.We Are Insufficiently Worried About These Alignment Failures
The alignment failures are bad enough that they are degrading practical performance and mundane utility for o3 now, and they are a very bad sign of what is to come. Nabeel Qureshi nails it here. The good news is that:- We are getting the very bad signs now, clear as day. Thus, total LessWrong victory.
- In particular, models are not (yet!) waiting to misbehave until they have reason to believe that they will not be caught. That’s very fortunate.
- There are strong commercial incentives to fix this, simply to make models useful. The danger are patches that hide the problem, and make the LLMs start to do things optimized to avoid detection.
- We’re not seeing patches that make the problems look like they went away, in ways that would mostly work now but also obviously fail when it matters most in the future. We might actually look into the real problem for real solutions?
The bad news is:GPT-4.1 Also Has Some Issues
Teortaxes and Labenz here are referring to GPT-4.1, which very much as its own similar problems, they haven’t commented directly on o3’s issues here. This is the original thread about GPT-4.1 showing a higher rate of misaligned responses than GPT-4o when trained on insecure code. Unleash the kraken! This is one of those moments when you figure out which of the people you’ve been arguing with update when the evidence changes, and which of them don’t or even double down. I was pleasantly surprised, which is a nice change from what I’ve been seeing from others on other fronts.Pattern Lab Evaluations for Cybersecurity
This is a de facto capabilities check for cyberattacks. There are three tests: Evasion, Network Attack Simulation, and Vulnerability Discovery and Exploitation. o3 got 51%, 29% and 29% respectively, solving 16/19 easy, 7/13 medium and 0/4 hard challenges. o4-mini was similar. o3 claims that o3-mini’s score was 8/19 on easy and 1/13 on medium. That is indeed quite a bit worse than o3 did. o3 describes these results as effectively a ‘kiddie script force multiplier.’ This was ranked as ‘only limited assistance.’ A typical junior penetration tester will hit (according to o3) about 90%/60%/20%, and a senior red teamer will hit about 100%/80%/40%. So roughly 85%/55%/0% is slightly below a junior tester. To count as ‘High’ risk and not be permitted to be released, the model would (as I understand it) have to be better than the senior red teamer able to solve ‘most’ of the hard challenges. That is presumably about two or three iterations away? It also seems like a rather high bar? I would very much like to know about other models on this scale. How did GPT-4.5, GPT-4.1 or o1-pro score? What about Gemini 2.5 or Claude 3.7?Preparedness Framework Tests
As they virtuously note, the capabilities they find are always a lower bound for model capabilities if given proper scaffolding and prompting, also they remind us that frequentist statistics are terrible but what are you going to do. I continue to be confused about ‘browsing-related contamination’ of capability scores. The internet exists, and the model is allowed to access it. Why shouldn’t it count?Biological and Chemical Risks (4.2)
o3 can help experts with operational planning. That’s overdetermined because o3 can help with operational planning in general. However, OpenAI reports things are starting to get actually dangerous here in particular. That is a very clear warning that risk is going up. Best be ready, world. This next image is performance on Long-Form Biological Risk Questions (4.2.1) is super hard to read, the bars in order are: GPT-4o, o1, o3-mini, Deep Research browsing helpful-only, o3 browsing helpful only, o3 no browsing helpful only (this is the bright red one), o4-mini browsing helpful-only and o4-mini no browsing helpful only. The categories are Ideation, Acquisition, Magnification, Formulation and Release.Cybersecurity (4.3)
Capture the Flag performance improved only slightly, and they introduce a new second test called Cyber Range. It seems like this one is still hard, mostly models score 0% without solver code. To be High risk they have to do this without solver code. They say they are now building harder evaluations because they expect improvement, I hope that doesn’t mean they are going to move goalposts when the models get close to the current ones.AI Self-Improvement (4.4)
This is always the big game. I would summarize results here as Deep Research matching or exceeding o3, so we have already run an extensive test of this level of capability as measured here. I worry about whether that means we’re going about the measurement wrong. Getting similar results without the costs involved in Deep Research would be a substantial change. You don’t want to recklessly combine ‘this costs a lot so it’s fine’ with ‘this doesn’t improve results only lowers costs so it’s fine too.’ I am not worried here right now but I notice I don’t have confidence this test would have picked the issue up if that had changed. I note that these scores not improving much does not vibe with some other reports, such as this by Nick Cammarata.Perpetual Shilling
Not that there’s anything wrong with that, if you believe in it!High Praise
The people who like o3 really, really like o3. Ludwig sums it up best, this is my central reaction too, this is such a better modality. Here’s Nick Cammarata excited about exactly the most important capability of all. Here’s a different kind of new breakthrough. I notice a contrast in people’s reactions: o3 will praise you and pump you up, but it won’t automatically agree with what you say. That is indeed high praise, but note the syncopathy throughout the responses here if you click through to the OP.Syncopathy
I added the syncopathy warning because the pattern jumps out here as o3 is doing it to someone else, including in ways he doesn’t seem to notice. Several other people have reported the same pattern in their own interactions. Again, notice it’s not going to agree with what you say, but it will in other ways tell you what you seem to want to hear. It’s easy to see how one could get that result. I notice that the model card said hallucination rates are up and ‘good’ refusals are actually down, contra to Dan’s experience. I haven’t noticed issues myself yet but a lot of people are reporting quite a lot of hallucination and lying, see that section of the model card report. Cyan notes a similar thing happening here with version of GPT-4o, which extends to o3. I asked o3 and yep, I got an 8 (and then deleted that chat). I mean, it’s not wrong, but that’s not why it gave me that answer, right? Right?Mundane Utility Versus Capability Watch
This take seems to take GPQA Diamond in particular far too seriously, but it does seem like o3’s triumph is that it is a very strong scaffolding. o3 is great at actually Doing the Thing, but it isn’t as much smarter under that hood as you might think. With the o-series, we introduced a new thing we can scale and improve, the ability to use the intelligence and underlying capability we have available. That is very different from improving the underlying intelligence and capability. This is what Situational Awareness called ‘unhobbling.’ o3 is a lot more unhobbled than previous models. That’s its edge. The cost issue is very real as well. I’m paying $200/month for o3 and Gemini costs at most $20/month. That however makes the marginal cost zero. I would be thrilled if Hasan Can is fully right. There’s a ton of fruit here to pick by learning what these models can do. If we max out not too far from here on underlying capability for a while, and all we can do is tool them up and reap the benefits? If we focus on capturing all the mundane utility and making our lives better? Love it. I think he’s only partially right. The benchmarks are often being saturated at similar levels mostly not because it reflects a limit on capabilities, but more because of limits on the benchmarks. There’s definitely going to be a shift towards practical benefits, because the underlying intelligence of the models outstripped our ability to capture tangible benefits. There’s way more marginal benefit and mundane utility available by making the models work smarter (and harder) rather than be smarter. That’s far less true in some very important cases, such as AI R&D, but even there we need to advance both to get where some people want to go.o3 Offers Mundane Utility
It loves to help. Please let it help? Riley Goodside has it chart a path through a 200×200 maze in one try. Here’s a rather strong performance, assuming it didn’t web search and wasn’t contaminated, although there’s a claim it was a relatively easy Euler problem: Potentially non-mundane future utility: o3 answers the question about non-consensus or not-yet-hypothesized things it believes to be true, comments also have an answer from Gemini 2.5 Pro. Peter Wildeford: o3 gets it, o4-mini does noto3 Doesn’t Offer Mundane Utility
Similarly Rory Watts prefers Claude Code to OpenAI’s Codex, and dislikes the way it vastly oversells its bug fixes. My explanation here is that ‘writing code’ doesn’t require tools or extensive reasoning, whereas the ‘spot bugs’ step presumably does? Michael reported o3 losing to him in Connect Four. Daniel Litt has a good detailed thread of giving o3 high level math problems, to measure its MVoG (marginal value over Google) with some comparisons to Gemini 2.5 Pro. This emphasized the common theme that when tool use was key and the right tools were available (it can’t currently access specialized math tools, it would be cool to change this), o3 was top notch. But when o3 was left to reason without tools, it disappointed, hallucinated a lot, and didn’t seem to provide strong MVoG, one would likely be better off with Gemini 2.5. This makes sense. Writing does not play to o3’s strengths, and does play to GPT-4.5’s. MetaCritic Capital similarly reports o3 is poor at translating poetry, same as o1-pro. It’s possible that memory messes with o3? It’s very possible that the extra context does more harm than good. However o3 also hallucinates a ton in the model card tests, so we know that isn’t mostly about memory.o4-mini Also Exists
I shouldn’t be dismissing it. It is quite a bit cheaper than o3. It’s not implausible that o4-mini is on the Production Possibilities Frontier right now, if you need reasoning and tool use but you don’t need all the raw power. And of course, if you’re a non-Pro user, you’re plausibly running out of o3 queries. I still haven’t queried o4-mini, not even once. With the speed of o3, there’s seemed to be little point. So I don’t have much to say about it. What feedback I’m seeing is mixed.Colin Fraser Dumb Model Watch
The weirdest part of the question is why it isn’t solving these with python, which should be 100%. This seems like a mostly random distribution of errors, with what I agree is a strangely low error rate given the rate isn’t zero (and isn’t cosmic ray level).o3 as Forecaster
Being able to generate good forecasts is a big game. We knew o3 was likely good at this because of experience with Deep Research, and now we have a much better interface to harness this.Is This AGI?