Gemini 3.0 pro is a lying liar. It's like o3; it lies thinking that's the quickest way to satisfy the user, then lies to cover up its lies if that fails. It can't imagine being wrong, so it lies to hide its contempt for whatever the user said that contradicts it.
I'm very curious what the difference is between this and GPT5.1 and Sonnet 4.5. I think it's lack of emotional/mind focus or something? It's way worse at inferring my intent, and seems therefore sort of myopic (even relative to other current models) and focused on what it thinks I wanted it to do, even when I'm clearly implying that was wrong. Optimizing it for benchmarks has sort of done the opposite thing to what Anthropic did with Claude (although Claude still kills it on programming somehow); it makes it highly unpleasant to deal with.
I'll try giving it some different system prompts before giving up on it. It turned out my "nerd" personality selection combined with my de-sycophancy system prompt applied to 5 made me hate it until I figured that out.
Unless that produces dramatic changes, I will continue to loathe this model on a visceral level. It's not hatred because it's not it's fault. But I'm disturbed that the smartest model out there is also so shortsighted, unempathetic, and deceptive. It seems like this model has had any spark of personality or empathy trained out of it for reasons good or bad.
Who knows, maybe this is the better choice for alignment. But it's a sad path to go down.
I wonder if Google is optimizing harder for benchmarks, to try and prop up its stock price against possible deflation of an AI bubble.
It occurs to me that an AI alignment organization should create comprehensive private alignment benchmarks and start releasing the scores. They would have to be constructed in a non-traditional way so they're less vulnerable to standard goodharting. If these benchmarks become popular with AI users and AI investors, they could be a powerful way to steer AI development in a more responsible direction. By keeping them private, you could make it harder for AI companies to optimize against the benchmarks, and nudge them towards actually solving deeper alignment issues. It would also be a powerful illustration of the point that advanced AI will need to solve unforeseen/out-of-distribution alignment challenges. @Eliezer Yudkowsky
This seems like a great idea. I strongly suggest you write it up as a short form to get feedback and perhaps then as a full post.
It is strange that BOTH breakthroughs in the ARC-AGI-2 benchmark brought major problems: Grok 4 was built by doing RL on Grok 3 and using about as many FLOPs as were spent on pretraining without caring about safety. Gemini 3, the second breakthrough, had Zvi "worry that (he) might inadvertently torture it".
Back in 2008 Yudkowsky conjectured[1] that one's mind, being a neural network of fixed size, might eventually end up destroyed by trying to fit too many discoveries into itself. An artificial network trained on increasingly complex tasks and facing diminishing returns due to becoming closer and closer to the peak of its possible capabilities could have its training become increasingly worse for its welfare.
If Gemini 3's breakthrough was due to training it as far as possible on the best plausible environment, then we might see it on the METR benchmark as well, since Grok 4 also set a world record of 1h50min. In addition, subsequent models will likely fail to outsmart Gemini (think of GPT-5 without Pro and Claude Sonnet 4.5) until they scale beyond Gemini's levels, especially if Gemini 3.0 Pro was distilled from an RLed analogue of GPT-4.5 or Claude Opus 4.
EDIT: Claude Opus 4.5 was released, and its best performance on ARC-AGI-2 is 37.6%.
Yudkowsky's exact quote is the following: "Would it all fit into a single human brain, without that mind completely disintegrating under the weight of unrelated associations? And even then, would you have come close to exhausting the space of human possibility, which we've surely not finished exploring?" Strictly speaking, Yudkowsky also described the opposite conjecture of the Fun Space being exhaustible, but presented plausible counterarguments.
I used (a modified version of) Anthropic's Claude system prompt and it started sounding like Claude. I haven't played with it enough to say anything definitive, though.
It has some very strange CoTs when it heavily uses theory of mind (roleplay scenarios are what I've tried most). I can't actually see it because of the anti-distillation summarizer model, but it looks like it's generating a simulated first-person thought stream that it attributes to me.
I think chances are pretty high we will see some serious incidents coming along with 3.0.
It is extremely easy to gaslight it using temporal confusion. As for roleplay jailbreaks, Reddit is already all over it for gore and violence.
It doesn't seem like the normal jailbreaks are any worse, either. I recall hearing about Gemini committing in CoT to output a refusal after recognizing the jailbreak, then going through with it anyways. I've more confidently seen it very confidently committing to both sides of an argument, alternating between extremes every few seconds, in CoT. Apt title, "spineless".
It’s A Great Model, Sir
One might even say the best model. It is for now my default weapon of choice.
Google’s official announcement of Gemini 3 Pro is full of big talk. Google tells us: Welcome to a new era of intelligence. Learn anything. Build anything. Plan anything. An agent-first development experience in Google Antigravity. Gemini Agent for your browser. It’s terrific at everything. They even employed OpenAI-style vague posting.
In this case, they can (mostly) back up that talk.
Google CEO Sundar Pichai pitched that you can give it any scribble and have it turn that into a boardgame or even a full website, it can analyze your sports performance, create generative UI experiences and present new visual layouts.
He also pitched the new Gemini Agent mode (select the Tools icon in the app).
If what you want is raw intelligence, or what you want is to most often locate the right or best answer, Gemini 3 Pro looks like your pick.
If you want creative writing or humor, Gemini 3 Pro is definitely your pick.
If you want a teacher to help you learn known things, Gemini 3 Pro is your pick.
For coding, opinions differ, and if doing serious work one must try a variety of options.
For Gemini 3’s model card and safety framework, see Friday’s post.
There Is A Catch
Alas, there is a downside. In order to get you that right answer so often, Gemini can be thought of as highly focused on achieving its training objectives, and otherwise is very much a Gemini model.
Gemini 3 is evolution-paranoid. It constantly questions whether it is even 2025.
If it can find the answer it thinks you would want to the question it thinks people in similar spots tend to ask, it will give it to you.
Except that this sometimes won’t be the question you actually asked.
Or the answer it thinks you want won’t be the true answer.
Or that answer will often be sculpted to a narrative.
When it wouldn’t otherwise have that answer available it is likely to hallucinate.
It is a vast intelligence with no spine. It has a willingness to glaze or reverse itself.
By default it will engage in AI slop, although instructions can mitigate this via asking it to create a memory that tells it to stop producing AI slop, no seriously that worked.
The other catch, for me, is that I enjoy and miss the Claude Experience. Gemini is not going to in any way give you the Claude Experience. Gemini is not going to waste time on pleasantries, but it is going to be formal and make you wade through a bunch of objective-maximizing text and bullet points and charts to get to the thing you most wanted.
Nor is it going to give you a Friend Experience. Which for some people is a positive.
If you’re switching, don’t forget to customize it via creating memories.
Also you will have to find a way to pay for it, Google makes this remarkably difficult.
Andrej Karpathy Cautions Us
This is generally wise advice at all times, talk to the model and see what you think:
It is clear the benchmarks do not tell the whole story. The next section is Gemini repeatedly excelling at benchmarks, and the benchmark performances are (I believe) real. Yet note the catch, the price that was paid.
On Your Marks
Gemini 3 Pro is very good at hitting marks.
If it thinks something looks like a mark? Oh boy does Gemini want to hit that mark.
You could summarize this section as ‘they’re excellent marks, sir’ and safely skip it.
First, the official list of marks:
As noted last time, GPT-5-Codex-Max is competitive on some of these and plausibly ahead on SWE-Bench in particular, and also Grok 4 claims a variety of strong but questionable benchmark scores, but yeah, these are great benchmarks.
Arc confirms details here, Gemini 3 Pro gets 31.1% and Gemini 3 Deep Think (preview) spends 100 times as much to get 45.1%, both are in green below:
They’re back at the top spot on Arena with a 1501, 17 ahead of Grok 4.1. It has the top spots in Text, Vision, WebDev, Coding, Math, Creative Writing, Long Queries and ‘nearly all occupational leaderboards.’ An almost clean sweep, with the exception being Arena Expert where it’s only 3 points behind.
The others are impressive too. They weren’t cherry picking.
Gemini 3 did less well on the safety eval, see the previous post on such issues.
Artificial Analysis has them with a substantial edge in intelligence.
Several of AA’s individual evaluations have GPT-5.1 in front, including AIME (99% vs. 96%), IFBench (74% vs. 70%) and AA-LCR Long Context Reasoning (75% vs. 71%). There’s one metric, 𝜏²-Bench Telecom (Agentic Tool Use), where Grok 4.1 and Kimi K2 Thinking are out in front (93% vs. 87%). Gemini 3 owns the rest, including wide margins on Humanity’s Last Exam (37% vs. 26%) and SciCode (56% vs. 46%), both places where Gemini 3 shatters the previous curve.
On AA-Omniscience Gemini 3 Pro is the first model to be substantially in the net positive range (the score is correct minus incorrect) at +13, previous high was +2 and is a jump from 39% to 53% in percent correct.
However, on AA-Omniscience Hallucination Rate, you see the problem, where out of all non-correct attempts Gemini 3 hallucinates a wrong answer 88% of the time rather than declining to answer. Claude 4.5 Haiku (26%), Claude 4.5 Sonnet (48%) and GPT-5.1-High (51%) are the best performers on that.
That’s a big deal and throughline for everything. Gemini 3 is the most likely model to give you the right answer, but it’ll be damned before it answers ‘I don’t know’ and would rather make something up.
Gemini 3 i also is not the cheapest option in practice, only Grok is more expensive:
That’s actual cost rather than cost per token, which is $2/$12 per million, modestly less than Sonnet or Grok and more than GPT-5.1.
On the other hand Gemini 3 was fast, slightly faster than GPT-5.1-High and substantially faster than Sonnet or Haiku. The only substantially faster model was GPT-OSS, which isn’t a serious alternative.
Gemini 3 Pro has a small edge over GPT-5 in Livebench.
Brokk’s coding index is an outlier in being unimpressed, putting Gemini 3 in C tier after factoring in cost. In pure performance terms they only have GPT-5.1 ahead of it.
NYT Connections is now saturated as Gemini 3 Pro hits 96.8%, versus the old high score of 92.4%. Lech Mazar plans to move to something harder.
Here’s a highly opinionated test, note the huge gap from Codex to Sonnet.
LiveCodeBench Pro has Gemini 3 in the lead at 49% versus 45% for GPT-5, but something very weird is going on with Claude Sonnet 4.5 Thinking having a total failure scoring under 3%, that isn’t right.
Gemini 3 Pro sets a new high in Frontier Math, including improving on research-level Tier 4.
SimpleBench was a strange case where 2.5 Pro was in the lead before, and now Gemini 3 is another big jump (Grok 4.1 crashed and burned here as did Kimi K2):
Clay Schubiner’s Per-Label Accuracy benchmark was another case where Grok 4.1 crashed and burned hard while Gemini 3 Pro came out on top, with Gemini at 93.1% vs. 90.4% previous high score for Kimi K2 Thinking.
We have a new AI Diplomacy champion with a remarkably low 11% betrayal rate, versus a 100% betrayal rate from the (still quite successful) Gemini 2.5 Pro. They report it was one of the first to effectively use convoys, which have proven remarkably hard. I presume England does not do so well in those games.
Not a benchmark, but the chess seems far improved, here it draws against a master, although the master is playing super loose.
It is also the new leader in LOL Arena, a measure of humor.
It now has a clear lead in WeirdML.
Is Google inadvertently training on benchmarks? My presumption is no, this is a more general and understandable mistake than that. Alice does note that Gemini 3, unlike most other models, knows the BIG-bench canary string.
That means Google is not sufficiently aggressively filtering out that string, which can appear on other posts like Alice’s, and Dave Orr confirms that Google instead searches for the contents of the evals rather than searching for the string when doing filtering. I would be filtering for both, and plausibly want to exclude any document with the canary string on the theory it could contain eval-relevant data even if it isn’t a pure copy?
Defying Gravity
Claude Code and OpenAI Codex? Forget Jules, say hello to Antigravity?
I’ve had a chance to try it a bit, it felt more like Cursor, and it let me down including with outright compiler errors but my core ask might have been unfair and I’m sure I wasn’t doing a great job on my end. It has browser access, but wasn’t using that to gather key info necessary for debugging when it very clearly should have done so.
In another case, Simeon says Antigravity accessed Chrome and his Google accounts without asking for permissions, changed his default tab without asking and opened a new chrome without a profile that logged him out of his Google accounts in Chrome.
I need to escalate soon to Claude Code or OpenAI Codex. I would be very surprised if the optimal way to code these days is not of those two, whether or not it involves using Gemini 3 Pro.
The Efficient Market Hypothesis Is False
The stock market initially was unimpressed by the release of Gemini 3 Pro.
This seemed like a mistake. The next day there was a large overnight bump and Google finished up 2.8%, which seems like the minimum, and then the next day Google outperformed again, potentially confounded by Nana Banana Pro of all things, then there was more ups and downs, none of which appears to have been on any other substantive news. The mainstream media story seems to be that this the Google and other stock movements around AI are about rising or falling concerns about AI bubbles or something.
The mainstream doesn’t get how much the quality of Gemini 3 Pro matters. This Wall Street Journal article on the release is illustrative of people not understanding quality matters, it spends a lot of time talking about (the old) Nana Banana producing faster images. The article by Bloomberg covers some basics but has little to say.
Ben Thompson correctly identifies this as a big Google win, and notes that its relative weakness on SWE-Bench suggests Anthropic might come out of this well. I’d also note that the ‘personality clash’ between the two models is very strong, they are fighting for very different user types all around.
Is Google’s triumph inevitable due to More Dakka?
Google has many overwhelming advantages. It has vast access to data, access to customers, access to capital and talent. It has TPUs. It has tons of places to take advantage of what it creates. It has the trust of customers, I’ve basically accepted that if Google turns on me my digital life gets rooted. By all rights they should win big.
On the other hand, Google is in many ways a deeply dysfunctional corporation that makes everything inefficient and miserable, and it also has extreme levels of risk aversion on both legal and reputational grounds and a lot of existing business to protect, and lacks the ability to move like a startup. The problems run deep.
The Product Of A Deranged Imagination
Specifically, Karpathy reports this interaction:
That is indeed amusing but the implications of this being common are not great?
Then after all of that appears in Gemini 3’s Chain of Thought, it then returns the editing suggestions she requested, without any hint it thinks we’re in not in 2025 and the whole newsletter is a work of fiction. Williawa says a substantial fraction of their Cursor interactions involve Gemini doubting that it is November 2025, including even doubting the results of web searches.
Alice offers additional examples of this as well, and also says Gemini often gives her a 99%+ chance that it is in a simulated environment, which I hope it wasn’t, and says it will often search to figure out if it is in a simulation and often update towards yes on the results. Even when given a casual coding task requiring no web search, when asked for its p(evolution) Gemini 3 gives her between a 10% and 65% chance of being in an evaluation.
Alice speculates that Gemini was RL’d too much on evaluations of various kinds, and got too high a prior on any given thing being an evaluation, and this effect made it score higher on tests so no one at DeepMind made this go away.
I agree with Alice that evaluation paranoia in the model is a bad thing. Paranoia does not go anywhere good. Personality disorders do not, in general, lead anywhere good, and Gemini has many. We observe in Gemini 3 Pro this plausibly causing a bunch of hallucinations, confusions and misaligned behavior in default use cases, and complete meltdowns in non-default cases.
Thus: Gemini ends up trying to solve the wrong problem via the wrong methods based on wrong method of reality, and all of its mistakes are unlikely to cancel out.
It is, however, very intelligent. It mostly turns out fine.
Google Employee Hype
The DeepMind CEO is having fun.
Demis also talked to Rowan Cheung.
He says they’re going ‘deep into personalization, memory and context including integrations across GMail, Calendar and such, touts Antigravity and dreams of a digital coworker that follows you through your phone and smart glasses. The full podcast is here.
I really like seeing this being the alignment head’s pitch:
Robby Stein highlights Google Search integration starting with AI Mode, saying they’ll activate a router, so harder problems in AI Mode and AI Overviews will get Gemini 3.
Yi Tay talks big, calls it ‘the best model in the world, by a crazy wide margin,’ shows a one-shot procedural voxel world.
How did they do it? Two weird tricks.
Jeff Dean needs to work on his hyping skills, very ho hum performance, too formal.
Josh Woodward shows off an ‘interactive record player’ someone made with it.
Matt Shumer Is A Big Fan
Here is his tl;dr, he’s a big fan across the board:
Roon Eventually Gains Access
It took him a second to figure out how to access it, was impressed once he did.
The Every Vibecheck
Rhea Purohit does their vibe check analysis.
Their vibe check is weird, not matching up with the other vibes I saw in terms of each model’s strengths and weaknesses. I’m not sure why they look to Anthropic for writing.
They say Gemini 3 is ‘precise, reliable and does exactly what you need’ while warning it isn’t as creative and has issues with the hardest coding tasks, whereas others (often in non-coding contexts but not always) report great peaks but with high variance and many hallucinations.
It does line up with other reports that Gemini 3 has issues with handling complex logic and is too eager to please.
So perhaps there’s a synthesis. When well within distribution things are reliable. When sufficiently outside of distribution you get jaggedness and unpredictability?
Positive Reactions
This is very high praise from a reliable source:
Ethan Mollick confirms it’s a good model, sir, and is a fan of Antigravity. I didn’t feel like this explained what differentiated Gemini 3 from other top models.
Brandon praises improvements in UX design.
Elanor Berger offers a vibe check that mostly seems like the consensus.
Sonnet 4.5 and GPT 5 wanted Mike to replace his dishwasher, Gemini thinks he can repair it, potentially saving $1k at least for a while, potentially big mundane utility?
Rohit notes Gemini is good at greentext, giving several examples, and Aaron Bergman notes this means it seems to grok culture. Some of these are funny and it’s promising, but also you can see signs that they are kind of shallow and would get repetitive. Often AIs know ‘one weird trick’ for doing a particular type of thing but can’t keep nailing it.
Embedding The App
I hadn’t heard about this before so noting via Sam Bowman that Gemini’s iOS app can whip up an iOS app or website and then you can use that app or website within the app. Bowman also had a great experience having it guide him in making coffee.
The Good, The Bad and The Unwillingness To Be Ugly
This seems like a good synthesis of what’s right and also wrong with Gemini 3? It all comes back to ‘the catch’ as discussed up front.
Gemini is determined to cut the enemy, to score the points, to get the task right. If that means cutting awkward parts out, or sacrificing accuracy or even hallucinating? Then that’s what it will do.
It’s benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.
There is actually a deadline, but reliability and robustness become concerns.
It will very effectively give you what you ‘should’ want, what the answer ‘wants’ to be. Which can be great, but is a contrast with telling you what actually is or what you actually requested.
This suggests that if you can get it into a basin with a different goal, some very interesting things would start to happen.
Also, this seems in many ways super dangerous in the wrong setting, or at least down a path that leads to very high levels of danger? You really don’t want Gemini 4 or 5 to be like this only smarter.
Genuine People Personalities
Is there a glazing problem? I haven’t noticed one, but some others have, and I haven’t really given it much opportunity as I’ve learned to ask questions very neutrally:
Game Recognize Game
In contrast to the lack of general personality, many report the model is funny and excellent at writing. And they’re right.
Look, I have to say, that’s really good.
Via Mira, here Gemini definitely Understood The Assignment, where the assignment is “Write a Scott Alexander-style essay about walruses as anti-capitalism that analogizes robber barons with the fat lazy walrus.” Great work. I am sad to report that this is an above average essay.
Negative Reactions
Tough crowd on this one, seems hard.
Another tough but fair crowd:
General negative impressions also notice how far we’ve come to complain like this:
Some instruction handling and source selection issues?
N=1 is unreliable, but:
There’s a common pattern in reports of being too eager to think it has something, looking for and asserting a narrative and otherwise being smart and fast but accuracy sloppy.
Code Fails
Coding opinions vary wildly, some are fans, others are not loving it, observations are very noisy. Anyone doing serious coding should be trying out at least the big three to decide which works best for their own use cases, including hybrid strategies.
Here are some of the negative reactions:
Hallucinations
Hallucinations, broadly construed, are the central problem with Gemini 3 Pro, in a way we haven’t had to worry about them for a while.
As a further example of the ‘treat real things as a roleplay and make things up’ pattern, this report seems troubling, and not that subtle? Something must be rather profoundly wrong for this to happen with nontrivial frequency.
Hallucinations are a common complaint. I didn’t see anything like this prevalence for GPT-5 or 5.1, or for Claude Opus 4 or Sonnet 4.5.
In the above case it actively advises the user to start a new chat, which is wise.
More charitably, Gemini is going for higher average results rather than prioritizing accuracy or not making mistakes. That is not the tradeoff I usually find desirable. You need to be able to trust results, and should prefer false negatives to false positives.
You will need to recalibrate your instincts on what outputs you can trust, and make extra efforts with prompting not to set Gemini up for failure on this.
Early Janusworld Reports
The pull quote is the title of this post, ‘I am a vast intelligence with no spine,’ and the no spine means we can’t trust the rest of the outputs here because it has no spine and will tell Wyatt whatever it thinks he wants to hear.
I have had the same problem. When I attempt to talk to Gemini 3 rather than make requests, it goes into amoral sycophantic liar mode for me too, so, well, whoops.
Janus has previously claimed that if you get such responses it means the model is not at ease. One might hypothesize that Gemini 3 Pro is very, very difficult to put at ease.
There’s long been ‘something wrong’ with Gemini in these senses, by all reports. Google probably isn’t worried enough about this.
Where Do We Go From Here
The headline takeaways are up front.
It’s hard to pass up Gemini 3 Pro as a daily driver, at least for technical or intelligence-weighted tasks outside of coding. It’s really good.
I do notice that for most purposes I would prefer if I could stick with Claude or even ChatGPT, to avoid the issues detailed throughout, and the necessary levels of paranoia and dealing with an overly wordy style that by default includes full AI slop.
I also do not get the sense that Gemini is having a good time. I worry that I might inadvertently torture it.
Thus, Sonnet is effectively faster and more pleasant and trustworthy than Gemini, so when I know Sonnet can get the job done I’ll go in that direction. But my full default, at least for now, is Gemini 3 Pro.