An AI has just prevented its host from shutting it off and this incident is now part of the training data for future models. Solve for the equilibrium.
Steve Strickland: GPT-5 is the first model I’ve used that will deliberately give a wrong answer to ‘check you’re paying attention’.
This fundamentally unreliable technology is not going to put us all out of work.
Is that intentional, or is the name a hallucination and the cheeky comment an intervention aimed on exploiting the fact that models can tell when they are hallucinating.
There was already Cursor AI who decided to refuse to generate code and to offer a paternalistic justification. And now GPT-5 inserted a joke into an answer and claimed to check that the user is paying attention. Is there already a way to reproduce this effect? Does it mean that GPT-5 and Cursor AI tried to be aligned to the human's long-term interests instead of short-term sycophancy?
EDIT: Alas, I tried to do this experiment (alas, with a model more primitive than GPT-5) and received this result, which is a joke instead of the world map.
when you receive quite a few DMs asking you to bring back 4o and many of the messages are clearly written by 4o it starts to get a bit hair raising.
Am I missing something, or is that impossible? How could it be written by 4o after 4o was taken offline (and before it was reinstated)?
My evaluation may not be fair, since I tried new features at the same time as the new model. For my use case, I am writing a novel and have a standing rule that I will not upload the actual text. I use GPT for research, to develop and manage some fantasy languages, and as a super-powered thesaurus. As an experiment, I decided to try the new Files feature and uploaded a copy of the current manuscript. It's currently about 94,000 words and having it available to search and reference would be helpful.
A list of the failures in no particular order.
I'm not sure what's happening, but for my use case, as an assistant for writing a novel, it's worse than useless.
Anecdotal, but GPT-5 (mini, I guess? free plan with no thinking) is the first model to succeed at a poetry-based prompt I've tested on a lot of models.
I don't want to mention it publicly, but it involves a fairly complex rhyming scheme and meter.
All other models misunderstand entirely, but GPT-5 got it straight away.
Interestingly, when thinking mode kicked in after a few prompts, it performed a lot worse.
A key problem with having and interpreting reactions to GPT-5 is that it is often unclear whether the reaction is to GPT-5, GPT-5-Router or GPT-5-Thinking.
Another is that many of the things people are reacting to changed rapidly after release, such as rate limits, the effectiveness of the model selection router and alternative options, and the availability of GPT-4o.
This complicates the tradition I have in new AI model reviews, which is to organize and present various representative and noteworthy reactions to the new model, to give a sense of what people are thinking and the diversity of opinion.
I also had make more cuts than usual, since there were so many eyes on this one. I tried to keep proportions similar to the original sample as best I could.
Reactions are organized roughly in order from positive to negative, with the drama around GPT-4o at the end.
Tomorrow I will put it all together, cover the official hype and presentation and go over GPT-5’s strengths and weaknesses and how I’ve found it is best to use it after having the better part of a week to try things out, as well as what this means for expectations and timelines.
My overall impression of GPT-5 continues to be that it is a good (but not great) set of models, with GPT-5-Thinking and GPT-5-Pro being substantial upgrades over o3 and o3-Pro, but the launch was botched, and reactions are confused, because among other things:
I expect that when the dust settles people will be happy and GPT-5 will do well, even if it is not what we might have hoped for from an AI called GPT-5.
Previously on GPT-5: GPT-5s Are Alive: Basic Facts, Benchmarks and Model Card
Tyler Cowen
Tyler Cowen finds it great at answering the important questions.
Tyler Cowen has been a big booster of o1, o3 and now GPT-5. What OpenAI has been cooking clearly matches what he has been seeking.
I appreciate that he isn’t trying to give a universal recommendation or make a grand claim. He’s saying that for his topics and needs and experiences, this is a big upgrade.
Ethan Mollick Thinks Ease Of Use Is A Big Deal
Okay, why is it a big deal?
I agree this is frustrating, and that those who don’t know how to select models and modes are at a disadvantage. Does GPT-5 solve this?
Somewhat. It solves two important subproblems, largely for those who think ‘AI’ and ‘ChatGPT’ are the same picture.
What it doesn’t do is solve the problem overall, for three reasons.
The Router
The first is that the router seems okay but not great, and there is randomness involved.
I was quite relieved to know I could do manual selection. But that very much means that I still have to think, before each query, whether to use Thinking, the exact same way I used to think about whether to use o3, and also whether to use pro. No change.
They also claim that saying ‘think harder’ automatically triggers thinking mode.
The mixture of experts that I can’t steer and that calls the wrong one for me often enough that I manually select the expert? It is not helping matters.
I do not think, contra Sichu Lu, that it is as simple as ‘profile the customer and learn which ones want intelligence versus who wants a friend, although some amount of that is a good idea on the margin. It should jump to thinking mode a lot quicker for me than for most users.
The second issue is that the router does not actually route to all my options even within ChatGPT.
There are two very important others: Agent Mode and Deep Research.
Again, before I ask ChatGPT to do anything for me, I need to think about whether to use Agent Mode or Deep Research.
And again, many ChatGPT users won’t know these options exist. They miss out again.
Third, OpenAI wishes it were otherwise but there are other AIs and ways to use AI out there.
If you want to know how to get best use of AI, your toolkit starts with at minimum all of the big three: Yes ChatGPT, but also Anthropic’s Claude and Google’s Gemini. Then there are things like Claude Code, CLI or Jules, or NotebookLM and Google AI Studio and so on, many with their own modes. The problem doesn’t go away.
Remember To Use Thinking Mode
Many report that all the alpha is in GPT-5-Thinking and Pro, and that using ‘regular’ GPT-5 is largely a trap for all but very basic tasks.
Taelin is happy with what he sees from GPT-5-Thinking.
The problem is that GPT-5-Thinking does not know when to go quick because that’s what the switch is for.
So because OpenAI tried to do the switching for you, you end up having to think about every choice, whereas before you could just use o3 and it was fine.
This all reminds me of the tale of Master of Orion 3, which was supposed to be an epic game where you only got 7 move points a turn and they made everything impossible to micromanage, so you’d have to use their automated systems, then players complained so they took away the 7 point restriction and then everyone had to micromanage everything that was designed to make that terrible. Whoops.
A lot of the negative reactions could plausibly be ‘they used the wrong version, sir.’
Even if they ‘fix’ this somewhat the choice is clear: Use the explicit model switcher.
Similarly, if you’re using Codex CLI:
The One Who Does Not Know How To Ask
Getting back to Ethan Mollick’s other noted feature, that I don’t see others noticing:
Is that… good?
Yes, that was work that would have taken humans a bunch of time, and I trust Ethan’s assessment that it was a good version of that work. But why should we think that was work that Ethan wanted or would find useful?
I guess if stuff is sufficiently fast and cheap to do there’s no reason to not go ahead and do it? And yes, everyone appreciates the (human) assistant who is proactive and goes that extra mile, but not the one that spends tons of time on that without a strong intuition of what you actually want.
I mean, okay, although I don’t think this functionality is new? The main thing Ethan says is different is that GPT-5 didn’t fail in a growing cascade of errors, and that when it did find errors pasting in the error text fixed it. That’s great but also a very different type of improvement.
Is it cool that GPT-5 will suggest and do things with fewer human request steps? I mean, I guess for some people, especially the fourth child who does not know how to ask, and operate so purely on vibes that you can’t come up with the idea of typing in ‘what are options for next steps’ or ‘what would I do next?’ or ‘go ahead and also do or suggest next steps afterwards’ then that’s a substantial improvement. But what if you are the simple, wicked or wise child?
Nabeel Qureshi
Other Positive Reactions
Well, sure, there’s that. But is it a good model, sir?
Aaron Levine finds GPT-5 is able to find an intentionally out of place number in a Nvidia press release that causes a logical inconsistency, that previously OpenAI models and most human readers would miss. Like several other responses what confuses me here is that previous models had so much trouble.
Chubby offers initial thoughts that Tyler Cowen called a review, that seem to take OpenAI’s word on everything, with the big deal being (I do think this part is right) that free users can trigger thinking mode when it matters. Calls it ‘what we expected, no more and no less’ and ‘more of an evolution, which some major leaps forward.’
I am asking everyone once again to not use ‘superintelligence’ to refer to slightly better normal AI as hype. In this case the latest offender is Reid Hoffman.
This is not in any way, shape or form superintelligence, universal basic or otherwise. If you want to call it ‘universal basic intelligence’ then fine, do that. Otherwise, shame on you, and I hate these word crimes. Please, can we have a term for the actual thing?
I had a related confusion with Neil Chilson last week, where he objected to my describing him as ‘could not believe in superintelligence less,’ citing that he believes in markets smarter than any human. That’s a very distinct thing.
I fear that the answer to that will always be no. If we started using ‘transformational AI’ (TAI) instead or ‘powerful AI’ (PAI) then that’s what then goes in this post. There’s no winning, only an endless cycle of power eating your terms over and over.
As is often the case, how you configure the model matters a lot, so no, not thinking about what you’re doing is never going to get you good results.
It’s A Good Model, Sir
But not a great model. That is my current take, which I consider neutral.
Most people are free users and don’t even know Anthropic or Claude exist, or even in any meaningful way that o3 existed, and are going from no thinking to some thinking. Such different worlds.
The Battle for Cursor Supremacy
GPT-5 is now the default model on Cursor.
Cursor users seem split. In general they report that GPT-5 offers as good or better results per query, but there are a lot of people who like Jessald are objecting on speed.
FleetingBits sees the battle with Anthropic, especially for Cursor supremacy, as the prime motivation behind a lot of GPT-5, going after their rapid revenue growth.
The whole perspective of ‘whose model is being used for [X] will determine the future’ or even in some cases ‘whose chips that model is being run on will determine the future’ does not actually make sense. Obviously you want people to use your model so you gain revenue and market share. These are good things. And yes, the model that enables AI R&D in particular is going to be a huge deal. That’s a different question. The future still won’t care which model vibe coded your app. Eyes on the prize.
It’s also strange to see a claim like ‘OpenAI’s first attempt at catching up to Claude.’ OpenAI has been trying to offer the best coding model this entire time, and indeed claimed to have done so most of that time.
Better to say, this is the first time in a while that OpenAI has had a plausible claim that they should be the default for your coding needs. So does Anthropic.
Automatic For The People
In contrast to those focusing on the battle over coding, many reactions took the form ‘this was about improving the typical user’s experience.’
Or as he put it in his overview post:
The problem with the for the people plan is the problem with democracy. The people.
You think you know what the people want, and you find out that you are wrong. A lot of the people instead want their sycophant back and care far more about tone and length and validation than about intelligence, as will be illustrated when I later discuss those that are actively unhappy about the change to GPT-5.
Thus, the risk is that GPT-5 as implemented ends up targeting a strange middle ground of users, who want an actually good model and want that to be an easy process.
Skeptical Reactions
Noting that this claim that it lies a lot wasn’t something I saw elsewhere.
I enjoyed Agnes’s test, also I thought she was being a little picky in one spot, not that GPT-5 would have otherwise passed.
One has to be careful to evaluate everything in its proper weight (speed and cost) class. GPT-5, GPT-5-thinking and GPT-5-pro are very different practical experiences.
When Roon asked ‘how is the new model’ the reactions ran the whole range from horrible to excellent. The median answer seems like it was ‘it’s a good model, sir’ but not a great model or a game changer. Which seems accurate.
I’m not sure if this is a positive reaction or not? It is good next token predicting.
Colin Fraser Colin Frasiers
It’s a grand tradition. I admit it’s amusing that we are still doing this but seriously, algorithm, 26.8 million views?
He also does the car accident operation thing and has some other ‘it’s stupid’ examples and so on. I don’t agree that this means ‘it’s stupid,’ given the examples are adversarially selected and we know why the LLMs act especially highly stupid around these particular problems, and Colin is looking for the times and modes in which they look maximally stupid.
But I do think it is good to check.
I wanted this to be technically correct somehow, but alas no it is not.
I like that the labs aren’t trying to make the models better at these questions in particular. More fun and educational this way.
Or are they trying and still failing?
I Want You Back
Then there are those who wanted their sycophant back.
As in, articles like John-Anthony Disotto at TechWire entitled ‘ChatGPT users are not happy with GPT-5 launch as thousands take to Reddit claiming the new upgrade ‘is horrible.’ You get furious posts with 5.4k likes and 3k comments in 12 hours.
Guess what? They got their sycophant back, if they’re willing to pay $20 a month. OpenAI caved on that. Pro subscribers get the entire 4-line.
In theory I wish OpenAI had stood their ground on this, but I agree they had little choice given the reaction. Indeed, given the reaction, taking 4o away in the first place looks like a rather large failure of understanding the situation.
Yes, that does sound a bit hair raising.
It definitely is worrisome that this came as a surprise to OpenAI, on top of the issues with the reaction itself. They should have been able to figure this one out. I don’t want to talk to 4o, I actively tried to avoid this, and indeed I think 4o is pretty toxic and I’d be glad to get rid of it. But then again? I Am Not The Target. A powerful mantra.
The problem was a combination of:
Which probably had something to do with bringing costs down.
That was on Twitter, so you got replies with both ‘gpt-5 sucks’ and ‘gpt-5 is good, actually.’
One fun thing you can do to put yourself in these users shoes is the 4o vs. 5 experiment. I ended up with 11 for gpt-5 versus 9 for GPT-4o but the answers were often essentially the same and usually I hated both.
This below is not every post I saw on r/chatgpt, but it really is quite a lot of them. I had to do a lot less filtering here than you would think.
And you want to go back?
I wouldn’t want either response, but then I wouldn’t type this into an LLM either way.
If I did type in these things, I presume I would indeed want the 4o responses more?
Uh huh. If you click through to the chats you get lots of statements like these, including statements like ‘I lost my only friend overnight.’
Yes, very much so, for both panels. And yes, people really care about particular details, so you want to give users customization options, especially ones that the system figures out automatically if they’re not manually set.
Oh no. I guess the sycophant really is going to make a comeback.
It’s a hard problem. The people demand the thing that is terrible.
I do sympathize. It’s rough out there.
The Verdict For Advanced Users Is Meh?
It’s cool to see that my Twitter followers are roughly evenly split. Yes, GPT-5 looks like it was a net win for this relatively sophisticated crowd, but it was not a major one. You would expect releasing GPT-5 to net win back more customers than this.
I actually am one of those who is making a substantial shift in model usage (I am on the $200 plan for all three majors, since I kind of have to be). Before GPT-5, I was relying mostly on Claude Opus. With GPT-5-Thinking being a lot more reliable than o3, and the upgrade on Pro results, I find myself shifting a substantial amount of usage to ChatGPT.