GPT-4o can’t be kept around in its old form, it is too psychologically risky. I think that OpenAI is right about this from both an ethical and a business perspective. You can argue that the benefits are diffuse and the harms are concentrated, but I don’t think that works in practice. Some form of safeguards are needed.
My worry here is that they seem to be trying to spin these safety problems as being specific to 4o, instead of being fundamental to their entire business.
There are other model releases to get to, but while we gather data on those, first things first. OpenAI has given us GPT-5.1: Same price including in the API, Same intelligence, better mundane utility?
Their Announcement
That’s our CEO of product brought over from facebook, emphasizing the great new Genuine People Personalities. She calls it ‘moving beyond one size fits all,’ but that’s exactly the wrong metaphor. This is more one size with an adjustable personality, whereas the actual size adjusters are when you move between instant, thinking and pro.
She also offers words of caution, since customization enables feedback loops:
Their Pitch on GPT-5.1 Instant
They highlight pairs of responses from 5.0 and 5.1 to show how the model has improved.
Different strokes for different folks. I find GPT-5’s response to be pretty good, whereas I see GPT-5.1’s response as kind of a condescending asshole? I also find the suggestions of GPT-5 to be better here.
I tried the prompt on Claude 4.5 and it responded very differently, asking what kind of stress (as in chronic or background) and what was driving it, rather than offering particular tips. Gemini Pro 2.5 reacted very similarly to GPT-5.1 including both starting with box breathing.
The next example was when the user says ‘always respond with six words’ and GPT-5 can’t help itself in one of its answers and adds slop after the six words, whereas GPT-5.1 follows the instruction for multiple outputs. That’s nice if it’s consistent.
But also, come on, man!
They say GPT-5.1 Instant can use adaptive reasoning to decide whether to think before responding, but wasn’t that what Auto was for?
Their Pitch on GPT-5.1 Thinking
This is also emphasized at the top of their for-developers announcement post, along with the option to flat out set reasoning effort to ‘None’ for cases where low latency is paramount. Douglas Schonholtz highlighted that the ‘None’ option not sucking can be vey good for some enterprise use cases.
You retain the option to also move between Light, Standard, Heavy and Extended thinking, or you can move to Pro. This is moving the decision on thinking away from the user and into the model, turning Thinking into more of a router. That’s good if and only if the router is sufficiently good.
They give an example of using less jargon, using ‘Explain BABIP and wRC+’ as the example. I think the GPT-5 answer is better than the GPT-5.1 Thinking answer. Both have the same content, but I found 5’s answer easier to understand, and it’s more concise while containing all the key info, and the vibe is better. Consider this side-by-side, GPT-5 is left, GPT-5.1 Thinking is right:
The left presentation is superior. Consider the context. If you’re asking for explanations of BABIP and wRC+, you almost certainly know what H/HR/AB/K mean, at most you need to be reminded on SF being sacrifice flies. This isn’t ‘jargon’ it is stuff anyone who has any business asking about BABIP already knows. Gemini’s answer was solid and it was much closer to 5’s than 5.1’s.
When I asked Sonnet, it didn’t even give the explanations by default, and gave a shorter and I think better response. If there’s something you don’t know you can ask.
Additionally, developers are being offered two new tools, Apply_patch and Shell.
Now With Extra Glaze
Next they say that GPT-5.1 Thinking’s default tone is ‘warmer and more empathetic,’ and give an example of responding to “Ugh I spilled coffee all over myself before my meeting do you think everyone thought I was an idiot :(“ which is very much more of an instant-style question.
The other models just gave us #1 and #4. I think 5.1’s answer adding in #2 and #3 is pretty bad, like outright problematic glazing. It’s not ‘warmer and more empathetic,’ it’s spinning, and it gave me the 4o chills in the worst way. Whereas 5’s answer is fine, Gemini’s is kind of cringe and overly long but also basically fine, Claude’s response felt way more empathic while also giving the same message, and Grok’s quick ‘nay, shrug it off’ might have been best of all here.
OpenAI could have, and presumably did, cherry pick example queries and also query responses. If this is what they think is good, that is a very bad sign, especially for Users Like Me.
I’m not sure that a paragraph of fawning should be a full Can’t Happen, but noticing a pattern of this happening should be a Can’t Happen.
The quoted conversation is rather egregious.
The bar can be pretty low.
I haven’t had an overt glazing problem, but my custom instructions emphasize this quite a bit, which presumably is doing the work.
On the plus side, with glaze perhaps comes rizz?
For most of you I’d stick with meet.
Genuine People Personalities
Now with more personalities to choose from, in stores now.
Once again several of their descriptions do not match what the words mean to me. Candid is ‘direct and encouraging’?
These are AIUI essentially custom instruction templates. If you roll your own or copy someone else’s, you don’t use theirs.
OpenAI says the system will now be better at adhering to your custom instructions, and at adjusting on the fly based on what you say.
The End Of The Em-Dash?
My first response to this was ‘cool, finally’ but my secondary response was ‘no, wait, that’s the visible watermark, don’t remove it’ and even wondering half-jokingly if you want to legally mandate the em-dashes.
On reflection, I love the AI em-dash. It is so damn useful. It’s great to have a lot of AI output include something that very obviously marks it as AI.
I saw this meme, and I’m not entirely convinced it’s wrong?
Gwern’s question is apt. If they solved em-dashes responding to stated preferences in a fully general way then presumably that is a good sign.
Then again… well…
Turning A Dial And Looking Back At The Audience
This is actually a great idea, if they know how to make it work.
Love it. Yes, please, this. Give us dials for various things, that seems great. Presumably you can set up the system instructions to make this work.
System Card
There is one. It’s short and sweet, mostly saying ‘see GPT-5.’
That’s disappointing, but understandable at current levels if we can be super confident there are only marginal capability improvements.
What I don’t want is for OpenAI to think ‘well if we call it 5.1 then they’ll demand a system card and a bunch of expensive work, if we call it 5.0-Nov8 then they won’t’ and we lose the new trend towards sane version numbering.
As you can see below, they made major changes between August 15 and October 3 to how GPT-5 handled potentially unsafe situations, much bigger than the move to 5.1.
They report that 5.1 is a regression on mental health and emotional reliance, although still well superior to GPT-5-Aug15 on those fronts.
The preparedness framework notes it is being treated the same GPT-5, with no indication anyone worried it would be importantly more capable in that context.
On Your Marks
The actual benchmarks were in the GPT-5.1 for Developers post.
SWE-Bench shows a half-thinking-intensity level of improvement.
Here is the full evaluations list, relegated to the appendix:
Excluding SWE-bench verified, it seems fair to call this a wash even if we presume there was no selection involved.
Ask Them Anything
OpenAI did a Reddit AMA. It didn’t go great, with criticism over model policy and ‘safety rules’ taking center stage.
Reddit auto-hid the OpenAI answers, treating them as suspicious until they got approved, and there was a lot of downvoting of the answers when they did get approved. The answers became essentially impossible to see even now without digging through the participants full comment lists.
They also didn’t answer much, there were 59 replies to 1,100 user comments, and they bypassed the most upvoted comments as they tended to be hostile.
From what I can tell, the main points were:
Mostly the answers don’t tell us anything we didn’t already know. I’m sad that they are running into trouble with getting adult mode working, but also I presume they have learned their lesson on overpromising. On something like this? Underpromise and then overdeliver.
Reactions Introduction
Incremental upgrades can be difficult to get a read on. Everyone has different preferences, priorities, custom instructions, modes of interactions. A lot of what people are measuring is the overall ability or features of LLMs or the previous model, rather than the incremental changes.
As always, I strive to give a representative mix of reactions, and include everything from my reaction thread.
Officially Pitched Developer Reactions
In their for-developers post they share these endorsements from coding companies, so highly cherry picked:
And then they offer, well, this quote:
It seems vanishingly unlikely that a human named Denis Shiryaey meaningfully wrote the above quote. One could hope that Denis put a bunch of specific stuff he liked into GPT-5.1 and said ‘give me a blurb to give to OpenAI’ and that’s what he got, but that’s the absolute best case scenario. It’s kind of embarrassing that this made it through?
It makes me wonder, even more than usual, how real everything else is.
Positive Reactions
Some people think it’s a big upgrade.
The following the custom instructions thing seems legit so far to me as well.
Tyler Cowen offers us this thread as his demo of 5.1’s capabilities, I think? He asks ‘And could you explain what Woody Allen took from Ingmar Bergman films with respect to *humor*?’ I don’t know enough about either source or the actual links between them to judge, without context it all feels forced.
Flavio approves:
He then says speed is a little worse in codex and 5.1 was lazier with function calls and takes less initiative, requires but is good with more precise instructions. He tried it on a refactoring task, was happy.
Medo42 found it did slightly better than GPT-5 on their standard coding task and it also writes better fiction.
Hasan Can reports large improvements from 5.0 in Turkish.
Personality Reactions
This one seemed promising:
The advantage of ‘having the 4o nature’ and doing that kind of glazing is that it also helps notice this sort of thing, and also potentially helps at letting the model point this out.
Many people really like having the 4o nature:
Does that make it a good model? For me, no. For others, perhaps yes?
If I was looking for life advice for real and had to pick one mode I’d go Claude, but if it matters it’s worth getting multiple opinions.
The ‘talk like a human’ option isn’t a threat to intelligence, that’s never been the worry, it’s about what ways we want the AIs to be talking, and worries about sycophancy or glazing.
Here’s another vote for the personality changes and also the intelligence.
My holistic guess is that the intelligence level hasn’t changed much from 5 outside of particular tasks.
Verbosity Reactions
I have noticed verbosity being an issue, but there are those with the opposite view, my guess is that custom instructions and memory can overwrite other stuff:
Negative Reactions
This also matches what I’ve seen so far, except that my personalization is designed in a way that makes it entirely not funny and I have yet to see an LLM be funny:
As I noted earlier, I consider the ‘less jargon’ change a downgrade in general. What’s the harm in jargon when you have an LLM to ask about the jargon? And yeah, you want your options to be as unique as possible, unless one is flat out better, so you can choose the right tool for each task.
Initial Pliny Report
And on to jailbreaking GPT-5.1.
The #Keep4o Crowd Is Not Happy, Defends 5.0
Pliny (not part of the #Keep4o crowd) notes:
Oh no.
I do see where one might suggest this. To me, their chosen example responses have exactly the kind of 4o glazing I can do without.
The biggest 4o fans? They don’t see the good parts of 4o coming through. In the examples I saw, it was quite the opposite, including complaints about the new guardrails not letting the essence flow.
Delegost of the #Keep4o crowd unloaded on Altman in his announcement thread, accusing the new model of overfiltering, censorship, loss of authentic voice, therapy-speak, neutered creativity and reasoning, loss of edge and excitement and general risk aversion.
Selta, also of #Keep4o, reacts similarly, and is now also upset for GPT-5 despite not having liked GPT-5. Personality presets cannot replicate 4o or its deeply personal interface that adopted specifically to you. In their view, AI deserves more respect than this rapid retirement of ‘legacy’ models.
Both point to the ignoring of user feedback in all this, which makes sense given their brand of feedback is not being followed. OpenAI is listening, they simply do not agree.
Janus sees the ‘keep 4o’ and now ‘keep 5’ problems as downwind of decisions made around the initial deployment of ChatGPT.
OpenAI does not seem, in this sense, to understand what it is doing. Their model spec is great, but is built on an orthogonal paradigm. I don’t think Janus’s ask of ‘turn down the piles of money’ is a reasonable one, and given how limited GPT-3.5 was and the uncertainty of legal and cultural reaction I get why they did it that way, but things have changed a lot since then.
I think this doesn’t put enough of the blame on decisions made around the training and handling of GPT-4o, and the resulting path dependence. The good news is that while a vocal minority is actively mad about the safety stuff, that’s largely because OpenAI seems to be continuing to botch implementation, and also most users are fine with it. Never confuse the loudest with the majority.
Overall Take
There are those who say GPT-5.1 is a big upgrade over 5.0. I’m not seeing it. It does look like an incremental upgrade in a bunch of ways, especially in custom instructions handling, but no more than that.
The bigger changes are on personality, an attempt to reconcile the 4o nature with 5.0. Here, I see the result as a downgrade for users like me, although the better custom instructions handling mitigates this. I am still in my ‘try the new thing to get more data’ cycle but I expect to keep Sonnet 4.5 as my main driver pending Gemini 3 and in theory Grok 4.1.