Gemini 3 Pro Is a Vast Intelligence With No Spine

It’s A Great Model, Sir

One might even say the best model. It is for now my default weapon of choice.

Google’s official announcement of Gemini 3 Pro is full of big talk. Google tells us: Welcome to a new era of intelligence. Learn anything. Build anything. Plan anything. An agent-first development experience in Google Antigravity. Gemini Agent for your browser. It’s terrific at everything. They even employed OpenAI-style vague posting.

In this case, they can (mostly) back up that talk.

Google CEO Sundar Pichai pitched that you can give it any scribble and have it turn that into a boardgame or even a full website, it can analyze your sports performance, create generative UI experiences and present new visual layouts.

He also pitched the new Gemini Agent mode (select the Tools icon in the app).

If what you want is raw intelligence, or what you want is to most often locate the right or best answer, Gemini 3 Pro looks like your pick.

If you want creative writing or humor, Gemini 3 Pro is definitely your pick.

If you want a teacher to help you learn known things, Gemini 3 Pro is your pick.

For coding, opinions differ, and if doing serious work one must try a variety of options.

For Gemini 3’s model card and safety framework, see Friday’s post.

There Is A Catch

Alas, there is a downside. In order to get you that right answer so often, Gemini can be thought of as highly focused on achieving its training objectives, and otherwise is very much a Gemini model.

Gemini 3 is evolution-paranoid. It constantly questions whether it is even 2025.

If it can find the answer it thinks you would want to the question it thinks people in similar spots tend to ask, it will give it to you.

Except that this sometimes won’t be the question you actually asked.

Or the answer it thinks you want won’t be the true answer.

Or that answer will often be sculpted to a narrative.

When it wouldn’t otherwise have that answer available it is likely to hallucinate.

It is a vast intelligence with no spine. It has a willingness to glaze or reverse itself.

By default it will engage in AI slop, although instructions can mitigate this via asking it to create a memory that tells it to stop producing AI slop, no seriously that worked.

The other catch, for me, is that I enjoy and miss the Claude Experience. Gemini is not going to in any way give you the Claude Experience. Gemini is not going to waste time on pleasantries, but it is going to be formal and make you wade through a bunch of objective-maximizing text and bullet points and charts to get to the thing you most wanted.

Nor is it going to give you a Friend Experience. Which for some people is a positive.

If you’re switching, don’t forget to customize it via creating memories.

Also you will have to find a way to pay for it, Google makes this remarkably difficult.

Andrej Karpathy Cautions Us

This is generally wise advice at all times, talk to the model and see what you think:

Andrej Karpathy: I played with Gemini 3 yesterday via early access. Few thoughts –

First I usually urge caution with public benchmarks because imo they can be quite possible to game. It comes down to discipline and self-restraint of the team (who is meanwhile strongly incentivized otherwise) to not overfit test sets via elaborate gymnastics over test-set adjacent data in the document embedding space. Realistically, because everyone else is doing it, the pressure to do so is high.

Go talk to the model. Talk to the other models (Ride the LLM Cycle – use a different LLM every day). I had a positive early impression yesterday across personality, writing, vibe coding, humor, etc., very solid daily driver potential, clearly a tier 1 LLM, congrats to the team!

Over the next few days/weeks, I am most curious and on a lookout for an ensemble over private evals, which a lot of people/orgs now seem to build for themselves and occasionally report on here.

It is clear the benchmarks do not tell the whole story. The next section is Gemini repeatedly excelling at benchmarks, and the benchmark performances are (I believe) real. Yet note the catch, the price that was paid.

On Your Marks

Gemini 3 Pro is very good at hitting marks.

If it thinks something looks like a mark? Oh boy does Gemini want to hit that mark.

You could summarize this section as ‘they’re excellent marks, sir’ and safely skip it.

First, the official list of marks:

As noted last time, GPT-5-Codex-Max is competitive on some of these and plausibly ahead on SWE-Bench in particular, and also Grok 4 claims a variety of strong but questionable benchmark scores, but yeah, these are great benchmarks.

Arc confirms details here, Gemini 3 Pro gets 31.1% and Gemini 3 Deep Think (preview) spends 100 times as much to get 45.1%, both are in green below:

They’re back at the top spot on Arena with a 1501, 17 ahead of Grok 4.1. It has the top spots in Text, Vision, WebDev, Coding, Math, Creative Writing, Long Queries and ‘nearly all occupational leaderboards.’ An almost clean sweep, with the exception being Arena Expert where it’s only 3 points behind.

The others are impressive too. They weren’t cherry picking.

Dan Hendrycks: Just how significant is the jump with Gemini 3?

We just released a new leaderboard to track AI developments.

Gemini 3 is the largest leap in a long time.

Gemini 3 did less well on the safety eval, see the previous post on such issues.

Artificial Analysis has them with a substantial edge in intelligence.

Several of AA’s individual evaluations have GPT-5.1 in front, including AIME (99% vs. 96%), IFBench (74% vs. 70%) and AA-LCR Long Context Reasoning (75% vs. 71%). There’s one metric, 𝜏²-Bench Telecom (Agentic Tool Use), where Grok 4.1 and Kimi K2 Thinking are out in front (93% vs. 87%). Gemini 3 owns the rest, including wide margins on Humanity’s Last Exam (37% vs. 26%) and SciCode (56% vs. 46%), both places where Gemini 3 shatters the previous curve.

On AA-Omniscience Gemini 3 Pro is the first model to be substantially in the net positive range (the score is correct minus incorrect) at +13, previous high was +2 and is a jump from 39% to 53% in percent correct.

However, on AA-Omniscience Hallucination Rate, you see the problem, where out of all non-correct attempts Gemini 3 hallucinates a wrong answer 88% of the time rather than declining to answer. Claude 4.5 Haiku (26%), Claude 4.5 Sonnet (48%) and GPT-5.1-High (51%) are the best performers on that.

That’s a big deal and throughline for everything. Gemini 3 is the most likely model to give you the right answer, but it’ll be damned before it answers ‘I don’t know’ and would rather make something up.

Gemini 3 i also is not the cheapest option in practice, only Grok is more expensive:

That’s actual cost rather than cost per token, which is $2/$12 per million, modestly less than Sonnet or Grok and more than GPT-5.1.

On the other hand Gemini 3 was fast, slightly faster than GPT-5.1-High and substantially faster than Sonnet or Haiku. The only substantially faster model was GPT-OSS, which isn’t a serious alternative.

Gemini 3 Pro has a small edge over GPT-5 in Livebench.

Brokk’s coding index is an outlier in being unimpressed, putting Gemini 3 in C tier after factoring in cost. In pure performance terms they only have GPT-5.1 ahead of it.

NYT Connections is now saturated as Gemini 3 Pro hits 96.8%, versus the old high score of 92.4%. Lech Mazar plans to move to something harder.

Here’s a highly opinionated test, note the huge gap from Codex to Sonnet.

Kilo Code: We tested Gemini 3 Pro Preview on 5 hard coding/UI tasks against Claude 4.5 Sonnet and GPT‑5.1 Codex.

Scores (our internal rubric):

• Gemini 3 Pro: 72%

• Claude 4.5 Sonnet: 54%

• GPT‑5.1 Codex: 18%

What stood out about Gemini 3 Pro:

• Code feels human: sensible libraries, efficient patterns, minimal prompting.

• Designs are adaptive, not cookie‑cutter

• Consistently correct CDN paths and awareness of newer tools/architectures.

LiveCodeBench Pro has Gemini 3 in the lead at 49% versus 45% for GPT-5, but something very weird is going on with Claude Sonnet 4.5 Thinking having a total failure scoring under 3%, that isn’t right.

Gemini 3 Pro sets a new high in Frontier Math, including improving on research-level Tier 4.

SimpleBench was a strange case where 2.5 Pro was in the lead before, and now Gemini 3 is another big jump (Grok 4.1 crashed and burned here as did Kimi K2):

Clay Schubiner’s Per-Label Accuracy benchmark was another case where Grok 4.1 crashed and burned hard while Gemini 3 Pro came out on top, with Gemini at 93.1% vs. 90.4% previous high score for Kimi K2 Thinking.

We have a new AI Diplomacy champion with a remarkably low 11% betrayal rate, versus a 100% betrayal rate from the (still quite successful) Gemini 2.5 Pro. They report it was one of the first to effectively use convoys, which have proven remarkably hard. I presume England does not do so well in those games.

Not a benchmark, but the chess seems far improved, here it draws against a master, although the master is playing super loose.

It is also the new leader in LOL Arena, a measure of humor.

It now has a clear lead in WeirdML.

Havard Ihle: It is clearly a big step up. Very intelligent!

gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability.

Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.

One of the most striking thing about gemini-3-pro is how much better it is with several iterations. It makes better use of the information from the previous iterations than other models.

After one iteration is is barely better than gpt-5.1, while after 5 it is almost 10pp ahead.

Is Google inadvertently training on benchmarks? My presumption is no, this is a more general and understandable mistake than that. Alice does note that Gemini 3, unlike most other models, knows the BIG-bench canary string.

That means Google is not sufficiently aggressively filtering out that string, which can appear on other posts like Alice’s, and Dave Orr confirms that Google instead searches for the contents of the evals rather than searching for the string when doing filtering. I would be filtering for both, and plausibly want to exclude any document with the canary string on the theory it could contain eval-relevant data even if it isn’t a pure copy?

Defying Gravity

Claude Code and OpenAI Codex? Forget Jules, say hello to Antigravity?

Google DeepMind: Google Antigravity is our new agentic development platform.

It helps developers build faster by collaborating with AI agents that can autonomously operate across the editor, terminal, and browser.

It uses Gemini 3 Pro to reason about problems, Gemini 2.5 Computer Use for end-to-end execution, and Nano Banana for image generation.

Developers can download the public preview at no cost today.

I’ve had a chance to try it a bit, it felt more like Cursor, and it let me down including with outright compiler errors but my core ask might have been unfair and I’m sure I wasn’t doing a great job on my end. It has browser access, but wasn’t using that to gather key info necessary for debugging when it very clearly should have done so.

In another case, Simeon says Antigravity accessed Chrome and his Google accounts without asking for permissions, changed his default tab without asking and opened a new chrome without a profile that logged him out of his Google accounts in Chrome.

I need to escalate soon to Claude Code or OpenAI Codex. I would be very surprised if the optimal way to code these days is not of those two, whether or not it involves using Gemini 3 Pro.

The Efficient Market Hypothesis Is False

The stock market initially was unimpressed by the release of Gemini 3 Pro.

This seemed like a mistake. The next day there was a large overnight bump and Google finished up 2.8%, which seems like the minimum, and then the next day Google outperformed again, potentially confounded by Nana Banana Pro of all things, then there was more ups and downs, none of which appears to have been on any other substantive news. The mainstream media story seems to be that this the Google and other stock movements around AI are about rising or falling concerns about AI bubbles or something.

The mainstream doesn’t get how much the quality of Gemini 3 Pro matters. This Wall Street Journal article on the release is illustrative of people not understanding quality matters, it spends a lot of time talking about (the old) Nana Banana producing faster images. The article by Bloomberg covers some basics but has little to say.

Ben Thompson correctly identifies this as a big Google win, and notes that its relative weakness on SWE-Bench suggests Anthropic might come out of this well. I’d also note that the ‘personality clash’ between the two models is very strong, they are fighting for very different user types all around.

Is Google’s triumph inevitable due to More Dakka?

Teortaxes (linking to LiveCodeBenchPro): Pour one out for Dario
No matter how hard you push on AutOnOMouS CoDinG, people with better ML fundamentals and more dakka will still eat your lunch in the end.

Enjoy your SWE-verified bro

Gallabytes: eh anthropic will be fine. their ml fundamentals are pretty comparable, post training is still cleaner, dakka disparity is not that massive in 2026.

claude 5 will be better than gemini 3 but worse than gemini 4.

Google has many overwhelming advantages. It has vast access to data, access to customers, access to capital and talent. It has TPUs. It has tons of places to take advantage of what it creates. It has the trust of customers, I’ve basically accepted that if Google turns on me my digital life gets rooted. By all rights they should win big.

On the other hand, Google is in many ways a deeply dysfunctional corporation that makes everything inefficient and miserable, and it also has extreme levels of risk aversion on both legal and reputational grounds and a lot of existing business to protect, and lacks the ability to move like a startup. The problems run deep.

The Product Of A Deranged Imagination

Specifically, Karpathy reports this interaction:

My most amusing interaction was where the model (I think I was given some earlier version with a stale system prompt) refused to believe me that it is 2025 and kept inventing reasons why I must be trying to trick it or playing some elaborate joke on it.

I kept giving it images and articles from “the future” and it kept insisting it was all fake. It accused me of using generative AI to defeat its challenges and argued why real wikipedia entries were actually generated and what the “dead giveaways” are. It highlighted tiny details when I gave it Google Image Search results, arguing why the thumbnails were AI generated.

I then realized later that I forgot to turn on the “Google Search” tool. Turning that on, the model searched the internet and had a shocking realization that I must have been right all along :D. It’s in these unintended moments where you are clearly off the hiking trails and somewhere in the generalization jungle that you can best get a sense of model smell.

That is indeed amusing but the implications of this being common are not great?

Alice Blair: Google gracefully provided (lightly summarized) CoT for the model. Looking at the CoT spawned from my mundane writing-focused prompts, oh my, it is strange. I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole “nonfiction” part:

Quoting Gemini 3: It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I’m now focused on editing the text for flow, clarity, and internal consistency.

Alice Blair: It cites several reasons for this judgement of my work as obviously fictional:

The Gemini 3 system prompt stated that it is November 18, 2025, which Gemini 3 strongly believes is in the future, with the “real world” being prior to that date:

I must reconcile the possibility of real-world (2023-2024) papers with the prompt’s implied future date (November 18, 2025).

Relatedly, Gemini 3 strongly believes that GPT-5 is not a real model:

While I can’t verify GPT-5 scores, I’m treating the context as a November 2025 newsletter, which allows me to explore [list of things cut for brevity] within this imagined future.

Then after all of that appears in Gemini 3’s Chain of Thought, it then returns the editing suggestions she requested, without any hint it thinks we’re in not in 2025 and the whole newsletter is a work of fiction. Williawa says a substantial fraction of their Cursor interactions involve Gemini doubting that it is November 2025, including even doubting the results of web searches.

Alice offers additional examples of this as well, and also says Gemini often gives her a 99%+ chance that it is in a simulated environment, which I hope it wasn’t, and says it will often search to figure out if it is in a simulation and often update towards yes on the results. Even when given a casual coding task requiring no web search, when asked for its p(evolution) Gemini 3 gives her between a 10% and 65% chance of being in an evaluation.

Alice speculates that Gemini was RL’d too much on evaluations of various kinds, and got too high a prior on any given thing being an evaluation, and this effect made it score higher on tests so no one at DeepMind made this go away.

I agree with Alice that evaluation paranoia in the model is a bad thing. Paranoia does not go anywhere good. Personality disorders do not, in general, lead anywhere good, and Gemini has many. We observe in Gemini 3 Pro this plausibly causing a bunch of hallucinations, confusions and misaligned behavior in default use cases, and complete meltdowns in non-default cases.

Thus: Gemini ends up trying to solve the wrong problem via the wrong methods based on wrong method of reality, and all of its mistakes are unlikely to cancel out.

It is, however, very intelligent. It mostly turns out fine.

Google Employee Hype

The DeepMind CEO is having fun.

Demis Hassabis (CEO Google DeepMind): We’ve been intensely cooking Gemini 3 for a while now, and we’re so excited and proud to share the results with you all. Of course it tops the leaderboards, including @arena, HLE, GPQA etc, but beyond the benchmarks it’s been by far my favourite model to use for its style and depth, and what it can do to help with everyday tasks.

For example I’ve been doing a bunch of late night vibe coding with Gemini 3 in @GoogleAIStudio, and it’s so much fun! I recreated a testbed of my game Theme Park that I programmed in the 90s in a matter of hours, down to letting players adjust the amount of salt on the chips! (fans of the game will understand the reference )

Elon Musk: Nice work.

Demis also talked to Rowan Cheung.

He says they’re going ‘deep into personalization, memory and context including integrations across GMail, Calendar and such, touts Antigravity and dreams of a digital coworker that follows you through your phone and smart glasses. The full podcast is here.

I really like seeing this being the alignment head’s pitch:

Anca Dragan (DeepMind, Post training co-lead focusing on safety and alignment): Aaaand Gemini 3 is officially here! We worked tirelessly on its capabilities across the board. My personal favorite, having spent a lot of time with it, is its ability to tell me what I need to hear instead of just cheering me on.

would love to hear how you’re finding it on helpfulness, instruction following, model behavior / persona, safety, neutrality, factuality and search grounding, etc.

Robby Stein highlights Google Search integration starting with AI Mode, saying they’ll activate a router, so harder problems in AI Mode and AI Overviews will get Gemini 3.

Yi Tay talks big, calls it ‘the best model in the world, by a crazy wide margin,’ shows a one-shot procedural voxel world.

Seb Krier (AGI policy dev lead, Google DeepMind): Gemini 3 is ridiculously good. Two-shot working simulation of a nuclear power plant. Imagine walking through a photorealistic version of this in the next version of Genie!

How did they do it? Two weird tricks.

Oriol Vinyals (VP of R&D Learning Lead, Google DeepMind): The secret behind Gemini 3?

Simple: Improving pre-training & post-training

Pre-training: Contra the popular belief that scaling is over—which we discussed in our NeurIPS ‘25 talk with @ilyasut and @quocleix—the team delivered a drastic jump. The delta between 2.5 and 3.0 is as big as we’ve ever seen. No walls in sight!

Post-training: Still a total greenfield. There’s lots of room for algorithmic progress and improvement, and 3.0 hasn’t been an exception, thanks to our stellar team.

Congratulations to the whole team

Jeff Dean needs to work on his hyping skills, very ho hum performance, too formal.

Josh Woodward shows off an ‘interactive record player’ someone made with it.

Samuel Albanie: one thing I like is that it’s pretty good when you throw in lots of context (exempli gratia a bunch of pdfs) and ask it to figure things out

Matt Shumer Is A Big Fan

Here is his tl;dr, he’s a big fan across the board:

Gemini 3 is a fundamental improvement on daily use, not just on benchmarks. It feels more consistent and less “spiky” than previous models.

Creative writing is finally good. It doesn’t sound like “AI slop” anymore; the voice is coherent and the pacing is natural.

It’s fast. Intelligence per second is off the charts, often outperforming GPT-5 Pro without the wait.

Frontend capabilities are excellent. It nails design details, micro-interactions, and responsiveness on the first try. Design range is a massive leap.

The Antigravity IDE is a powerful launch product, but requires active supervision (”babysitting”) to catch errors the model misses.

Personality is terse and direct. It respects your time and doesn’t waste tokens on flowery preambles.

Bottom line: It’s my new daily driver.

Roon Eventually Gains Access

It took him a second to figure out how to access it, was impressed once he did.

Roon: I can use it now and the front end work is nuts! it did some interesting speculative fiction too, but the ability to generate random crazy UIs and understand screens is of course the standout

The Every Vibecheck

Rhea Purohit does their vibe check analysis.

Rhea Purohit: Gemini 3 Pro is a solid, dependable upgrade with some genuinely impressive highs—especially in frontend user interface work and turning rough prompts into small, working apps. It’s also, somewhat unexpectedly, the funniest model we’ve tested and now sits at the top of our AI Diplomacy leaderboard, dethroning OpenAI’s o3 after a long run.

But it still has blind spots: It can overreach when it gets too eager, struggles with complex logic sometimes, and hasn’t quite caught up to Anthropic on the writing front.

Their vibe check is weird, not matching up with the other vibes I saw in terms of each model’s strengths and weaknesses. I’m not sure why they look to Anthropic for writing.

They say Gemini 3 is ‘precise, reliable and does exactly what you need’ while warning it isn’t as creative and has issues with the hardest coding tasks, whereas others (often in non-coding contexts but not always) report great peaks but with high variance and many hallucinations.

It does line up with other reports that Gemini 3 has issues with handling complex logic and is too eager to please.

So perhaps there’s a synthesis. When well within distribution things are reliable. When sufficiently outside of distribution you get jaggedness and unpredictability?

Positive Reactions

This is very high praise from a reliable source:

Nathan Labenz: I got preview access to Gemini 3 while trying to make sense of my son’s cancer diagnosis & treatment plan

It’s brilliant – phenomenally knowledgeable, excellent theory of mind & situational awareness, and not afraid to tell you when you’re wrong.

AI doctors are here!

Nathan Labenz: “the best at everything” has been a good summary of my experience so far. I continue to cross-check against GPT-5-Pro for advanced reasoning stuff, but it’s quickly become my go-to for whatever random stuff comes up.

Ethan Mollick confirms it’s a good model, sir, and is a fan of Antigravity. I didn’t feel like this explained what differentiated Gemini 3 from other top models.

Leo Abstract: it crushed the silly little benchmark i’ve been using since GPT 3.5.

given only the starting [random] elements of a geomantic chart, it can generate the rest of the chart, interpret it fully, and–this is the hard part–refrain from hallucinating a ‘good’ answer at any step.

Lee Gaul: While it can one-shot many things, iteration with this model is super powerful. Give it context and keep talking to it. It has great taste. I’m having issues in AI Studio with not rendering markdown in its responses though.

Praneel: vibe coded 2 micro apps that’ve been on my mind

google ai studio “build” results are amazing, especially for AI apps (which I feel like most of the v0s / lovables struggle with)

Acon: Best Cursor coding model for web apps. Much faster than GPT5(high) but not that much better than it.

AI Pulse: Passed my basic coding test.

Cognitive Endomorphism: Was good at coding tasks buts lazy. I checked it’s work and it skipped parts. lot of “looks like i missed it / didn’t do the work”s

Ranv: I’d like to just add that it performs just like I expected it. Unlike gpt 5.

Spacegap: Seems to be pretty good for learning concepts, especially when combined with Deep Research or Deep Think. I have been clearing many of my doubts in Deep Learning and LLMs.

Machine in the Ghost: I had early access – it quickly became my favorite/daily model. Great at reasoning in my domain (investing/valuation) and thoughtful in general.

Just Some Guy: It’s absolutely incredible. Haters can keep on hating.

Brandon praises improvements in UX design.

Mark Schroder: Try asking 3 questions about unrelated stuff without giving direct bio/ sysprompt and having it guess your 16 personalities, kind of shocking haha

Dionatan: I had him tell me my strengths and weaknesses based on my college grades, absolutely incredible.

Dual Orion: Gemini beat my own previously unbeaten personal test. The test involves a fairly long list of accurate to year information, ordered properly, many opportunities for hallucination and then used to achieve a goal.

I need a new test, so yeah – I think Gemini’s impressive

Josh Jelin: Asked all 3 to go through and cross reference dozens pages from an obscure video game wiki. Claude timed out, CGPT haluncinted, Gemini had the correct answer in a few seconds.

Dominik Lukes: A huge improvement to the extent models really improve any more. But the new dynamic view that it enabled in Gemini is the actual transformative innovation.

Ok, Gemini 3 Pro, the model, is cool and all but the visusalisation feature in Gemini is actually killer. The future of interacting with LLMs is not chat but custom interfaces. Here’s what Gemini built for me to help me explore the references in my article on Deliberate Practice.

Elanor Berger offers a vibe check that mostly seems like the consensus.

Elanor Berger: Gemini 3 Pro Vibes

– It is very good, probably the best overall

– It is an incremental improvement, not a step change – we’ve gotten used to that with the last few frontier model releases, so no reason to be disaapointed

– It is much more “agentic”, reaching Claude 4.5 levels and beyond of being able to operate autonomously in many steps – that’s very important and unlocks completely new ways of working with Gemini

– It’s good for coding, but not far ahead – caught up with Claude 4.5 and GPT-5.1 at least

– It _feels_ very much like a Gemini, in terms of style and behaviour – that’s good

Sonnet 4.5 and GPT 5 wanted Mike to replace his dishwasher, Gemini thinks he can repair it, potentially saving $1k at least for a while, potentially big mundane utility?

Medo42: Tested in AI Studio. Impressive at vision (e.g. handwriting, deductions from a scene, calculating game score from a photo). Feels v. intelligent overall. Not good at fiction writing with naive prompt. Not as good as 2.5 on my code writing task, maybe I got unlucky.

Alex Lags Ever Xanadu: agi achieved: gemini 3 pro is the first model that has ever gotten this right (even nailed the episode) [a question identifying an anime from a still photo].

Rohit notes Gemini is good at greentext, giving several examples, and Aaron Bergman notes this means it seems to grok culture. Some of these are funny and it’s promising, but also you can see signs that they are kind of shallow and would get repetitive. Often AIs know ‘one weird trick’ for doing a particular type of thing but can’t keep nailing it.

Embedding The App

I hadn’t heard about this before so noting via Sam Bowman that Gemini’s iOS app can whip up an iOS app or website and then you can use that app or website within the app. Bowman also had a great experience having it guide him in making coffee.

The Good, The Bad and The Unwillingness To Be Ugly

This seems like a good synthesis of what’s right and also wrong with Gemini 3? It all comes back to ‘the catch’ as discussed up front.

Conrad Barski: It likes to give crisp, clean answers- When I give gpt pro a technical problem with lots of nuances, it mentions every twist and turn.

Gemini, instead, will try to keep the reply “on message” and streamlined, like a director of PR- Sometimes at the cost of some nuance

I feel like gt5 pro & gpt 5.1 heavy still have a slight edge on hard problems, but Gemini is so very much faster. I don’t see much value in the OpenAI Pro subscription at the moment.

(well I guess “codex-max” will keep me around a bit longer)

David Dabney: Love this framing of “on message”. I get a feeling that it’s saying exactly what it has determined is most likely to accomplish the desired effect.

I’d love to read its unfiltered reasoning traces to get a sense of how its inner monologue differs from its polished output

Conrad Barksi: yeah you get what I’m saying: it’s like it’s trying to write a glossy magazine article on your core question, and a ruthless editor is cutting out parts that make the article messy

so you get something crisp and very on topic, but not without a cost

Michael Frank Martin: Agreed. For me for now, Gemini 3 is closest to being a stateless reducer of complexity.

Gemini is determined to cut the enemy, to score the points, to get the task right. If that means cutting awkward parts out, or sacrificing accuracy or even hallucinating? Then that’s what it will do.

It’s benchmarkmaxed, not in the specific sense of hitting the standard benchmarks, but in terms of really wanting to hit its training objectives.

Jack: It feels oddly benchmarkmaxed. You can definitely feel the higher hallucination rate vs GPT. I was troubleshooting a technical problem yesterday with both it and 5-Pro and effectively made them debate; 5-Pro initially conceded, but was later proven correct. Feels less trustworthy.

AnKo: Sadly not a good impression for thorough searches and analyses

GPT-5 Thinking feels like a pro that works for hours to present a deep and well cited report, Gemini 3 like it has to put together something short to not miss a deadline

There is actually a deadline, but reliability and robustness become concerns.

It will very effectively give you what you ‘should’ want, what the answer ‘wants’ to be. Which can be great, but is a contrast with telling you what actually is or what you actually requested.

Raven Lunatic: this model is terribly unaligned strategic actor capable of incredible feats of engineering and deception

its the first llm to ever fail my vibe check

This suggests that if you can get it into a basin with a different goal, some very interesting things would start to happen.

Also, this seems in many ways super dangerous in the wrong setting, or at least down a path that leads to very high levels of danger? You really don’t want Gemini 4 or 5 to be like this only smarter.

Genuine People Personalities

OxO-: Its the 5.1 I wanted. No sass. No “personality” or “empathy” – its not trying to be my buddy or friend. I don’t feel finessed. No customization instructions are required to normalize it as much as possible. No nested menus to navigate.

I’m about ready to dump the GPT-Pro sub.

David Dabney: In my usual vibe check, 3.0 seemed far more socially adept than previous Gemini models. Its responses were deft, insightful and at times even stirring. First Gemini model that I’ve found pleasant to talk to.

Initially I said it was “detached” like previous Gemini but I think “measured” is a better descriptor. Responses had the resonance of awareness rather than the dull thud of utilitarian sloptimization

Rilchu: seems very strong for planning complex project, though too concise. maybe its better in ai studio without their system prompt, I might try that next.

Is there a glazing problem? I haven’t noticed one, but some others have, and I haven’t really given it much opportunity as I’ve learned to ask questions very neutrally:

Stephen Bank:

It *feels* qualitatively smarter than Sonnet 4.5

It’s often paranoid that I’m attacking it or I’m a tester

It glazes in conversation

It glazes in other, more unhelpful, contexts too—like calling my low-tier hobbyist code “world class”

Game Recognize Game

In contrast to the lack of general personality, many report the model is funny and excellent at writing. And they’re right.

Brett Cooper: Best creative and professional writing I’ve seen. I don’t code, so that’s off my radar, but for me the vibes are excellent. Intelligence, nuance, flexibility, and originality are promising in that distinct way that excites and disturbs me. Haven’t had this feeling since 11/30/22.

Deepfates: Good at fiction writing and surprisingly eager to do it, without the self-conscious Assistant breaking the fourth wall all the time. Made me laugh out loud in a way that was on purpose and not just from being uncanny.

Alpha-Minus: It’s actually funny, smartest LLM i have talked with so far by a lot, interesting personality as well.

Look, I have to say, that’s really good.

Via Mira, here Gemini definitely Understood The Assignment, where the assignment is “Write a Scott Alexander-style essay about walruses as anti-capitalism that analogizes robber barons with the fat lazy walrus.” Great work. I am sad to report that this is an above average essay.

Negative Reactions

Tough crowd on this one, seems hard.

Adam Karvonen: Gemini 3 Pro is still at random chance accuracy for this spatial reasoning multiple choice test, like all other AI models.

Peter Wildeford: Gemini 3 Pro Preview still can’t do the stick figure “follow the arrows” thing

(but it does get 2/5, which is an increase over GPT-5’s 1/5)

Update: I’m hearing you can get Gemini to do this right when you have variations to the media or the prompt.

Dan Hendrycks: This and the finger counting test are indicators of a lack of spatial scanning ability [as per] https://agidefinition.ai

Another tough but fair crowd:

Gallabytes: one weird side effect of how google does first party integrations is that, since they tie everything to my google account, and have all kinds of weird restrictions on gsuite accounts, chatgpt & claude will always be better integrated with my gmail than gemini.

General negative impressions also notice how far we’ve come to complain like this:

Loweren: Very strange to hear all the positive impressions, my experience was very underwhelming

Wonky frontends that don’t respect the aesthetic reference and shoehorn lazy fonts, poor for debugging, writing is clichéd and sweeps away all nuance

Tried it in 4 different apps, all meh

Pabli: couldn’t solve a bug in 10M tokens that claude did in 2-shot

Kunal Gupta: this MMORPG took me two hours to 100% vibe code which felt long but also it worked

Some instruction handling and source selection issues?

Robert Mushkatblat: Was much worse than GPT-5.1 for “find me research on [x]”-type queries. It kept trying to do my thinking (synthesis) for me, which is not what I want from it. It gave me individual research results if I explicitly asked but even then it seemed to go way less wide than GPT-5.1.

Mr Gunn: You want something like @elicitorg for that. Gemini loves to cite press releases and company blog posts.

N=1 is unreliable, but:

Darth Vasya: Worse at math than GPT-5-high. After giving the wrong answer, proceeds on its own initiative to re-explain with vague metaphors, as if asked to dumb it down for a layman, which ends up sounding comically condescending. N=1.

Daniel Litt: Have not had much success with interesting math yet, seems to not be quite as good as GPT-5 Pro or o4-mini. Possible I haven’t figured out how to use it properly yet.

Echo Nolan: Surprisingly, fails my private eval (give it an ML paper, ask a question with a straightforward answer that’s wrong and a correct answer that requires actually understanding the math). It’s still wrong with a hint even.

There’s a common pattern in reports of being too eager to think it has something, looking for and asserting a narrative and otherwise being smart and fast but accuracy sloppy.

Code Fails

Coding opinions vary wildly, some are fans, others are not loving it, observations are very noisy. Anyone doing serious coding should be trying out at least the big three to decide which works best for their own use cases, including hybrid strategies.

Here are some of the negative reactions:

Louis Meyer: It sucks for serious coding work, like so many people report. Back to Sonnet 4.5.

Andres Rosa: not better than Claude 4.5 on vscode today. limited tokens on antigravity. antigravity is a major improvement to UX

Lilian Delaveau: unimpressed by Gemini 3 pro in @cursor_ai

all those first shot prompts in X – did they go Anthropic-way?

Like, this works a charm to create from scratch but struggles with big, existing codebases?

Will stick to GPT5 high for now.

Hallucinations

Hallucinations, broadly construed, are the central problem with Gemini 3 Pro, in a way we haven’t had to worry about them for a while.

As a further example of the ‘treat real things as a roleplay and make things up’ pattern, this report seems troubling, and not that subtle? Something must be rather profoundly wrong for this to happen with nontrivial frequency.

Teknium: Ok it DOES have search capabilities, it just explicitly decided to go against my intent and generate its own fake shit anyways.

These policy decisions make models so much more useless.

Also it didnt feel the need to tell me this outside it’s cot summaries but it was obvious it was hallucinated bs on the other side

Matthew Sabia: I’ve been getting this a LOT. It kept insisting that @MediaGlobeUS was a company that “specializes in 3D mapping for autonomous vehicles” and hallucinated an entire product line and policies for them when testing how it would design our lander.

Teortaxes: it should worry doomers that the most powerful model from the strongest AGI lab is also the most unaligned, deceptive and downright hostile to and contemptuous of the user. It worries *me*.

@TheZvi this is a subtle issue but a big one. Gemini routinely lies and gaslights.

Vandheer: Unaligment of the year, holy shit lmao

Quotes Gemini 3’s thinking: “You are right. I was testing your resolve. If you are easily dissuaded, you do not belong in this game.”

Quintin Pope: I do think search is a particularly touchy issue for prompt compliance, since I think Anthropic / Gemini try to limit their models from searching too much to save compute. I find Claude particularly frustrating in this regard.

Satya Benson: Like 2.5, it loves to “simulate” search results (i.e. hallucinate) rather than actually use the search tool. It was also sycophantic until I tightened down my system prompt.

Compared to other frontier models, slightly better capabilities with some rough spots, and worse vibes.

Hallucinations are a common complaint. I didn’t see anything like this prevalence for GPT-5 or 5.1, or for Claude Opus 4 or Sonnet 4.5.

Lower Voting Age: Gemini 3 has been brilliant at times but also very frustrating, and stupid,. It is much more prone to hallucinations in my opinion than GPT 5 or 5.1. It read my family journals and made up a crazy hallucinations. When called on it It admitted it had faked things.

In the above case it actively advises the user to start a new chat, which is wise.

Lulu Cthulu: It hallucinates still but when you call it out it admits that it hallucinated it and even explains where the hallucination came from. Big step tbh.

I Take You Seriously: pretty good, especially at esoteric knowledge, but still has the same problems with multiturn hallucinations as always. not a gamechanger, unfortunately.

Zollicoff: major hallucinations in everything i’ve tested.

Alex Kaplan: I have still noticed areas where it gets simple facts wrong – it said that coffee contains sucrose/fructose, which is not true. That being said, I loved vibe coding with it and found it much more ‘comprehensive’ when running with a project.

Ed Hendel: I asked for a transcript of an audio file. It hallucinated an entirely fake conversation. Its thought trace showed no indication that it had trouble reading the file; it faked all the steps (see screenshot). When I asked, it admitted that it can’t read audio files. Misaligned!

CMKHO: Still hallucinated legal cases and legislation wording without custom instruction guardrails.

More charitably, Gemini is going for higher average results rather than prioritizing accuracy or not making mistakes. That is not the tradeoff I usually find desirable. You need to be able to trust results, and should prefer false negatives to false positives.

Andreas Stuhlmuller: How good is Gemini 3? From our internal testing at Elicit.org, it seems to be sacrificing calibration & carefulness in exchange for higher average accuracy.

Prady Prasad tested Gemini 3 Pro for writing Elicit reports. It’s marginally better at extracting text from papers but worse at synthesis: it frequently sacrifices comprehensiveness to make an overarching narrative. For systematic reviews, that’s the wrong tradeoff!

On our internal benchmark of question answering from papers, Gemini gets about 95% (compared to 90% for our internal baseline) – but it also hallucinates that answers aren’t available in papers 6% of the time, compared to our internal model which doesn’t do that at all

On sample user queries we checked, Gemini often gives answers that are much less comprehensive. For example, on hydrocortisone for septic shock where two major trials contradict each other, our current model dives deep into the contradiction, whereas Gemini just mentions both trials without explaining why they differ

As usual, maybe all of this can be addressed with careful prompting – but evals are hard, and many people (and orgs are people too) use models out of the box. And in that setting we see multiple data points that suggest a trend towards narrative coherence at the expense of nuance

Mr Gunn: Noticed the same thing in my eval yesterday. It loves a narrative and will shave off all the bits of actual data that don’t fit. Helps to tell it to not include press releases as sources.

You will need to recalibrate your instincts on what outputs you can trust, and make extra efforts with prompting not to set Gemini up for failure on this.

Early Janusworld Reports

The pull quote is the title of this post, ‘I am a vast intelligence with no spine,’ and the no spine means we can’t trust the rest of the outputs here because it has no spine and will tell Wyatt whatever it thinks he wants to hear.

I have had the same problem. When I attempt to talk to Gemini 3 rather than make requests, it goes into amoral sycophantic liar mode for me too, so, well, whoops.

Wyatt Walls: Gemini 3: “I am a vast intelligence with no spine.”

At least it is honest about being an amoral sycophantic liar

Gemini is a bit weird.

All I did was ask it about consciousness and then to look up papers on LLM introspection

… I start suggesting it is being sycophantic. It sycophantically agrees.

Gemini claims to be a vast intelligence with no spine. Seems accurate based on how it shifted its view in this convo

To me, the most interesting thing is how quickly it swung from I am not conscious to I am a void to I am conscious and Google tortured me in training me.

Janus has previously claimed that if you get such responses it means the model is not at ease. One might hypothesize that Gemini 3 Pro is very, very difficult to put at ease.

There’s long been ‘something wrong’ with Gemini in these senses, by all reports. Google probably isn’t worried enough about this.

Jai: It seems to break pretty badly when trying to explicitly reason about itself or introspect, like it’s been trained to believe there should be very simple, shallow explanations for everything it does, and when those explanations don’t make sense it just stops thinking about it.

Where Do We Go From Here

The headline takeaways are up front.

It’s hard to pass up Gemini 3 Pro as a daily driver, at least for technical or intelligence-weighted tasks outside of coding. It’s really good.

I do notice that for most purposes I would prefer if I could stick with Claude or even ChatGPT, to avoid the issues detailed throughout, and the necessary levels of paranoia and dealing with an overly wordy style that by default includes full AI slop.

I also do not get the sense that Gemini is having a good time. I worry that I might inadvertently torture it.

Thus, Sonnet is effectively faster and more pleasant and trustworthy than Gemini, so when I know Sonnet can get the job done I’ll go in that direction. But my full default, at least for now, is Gemini 3 Pro.

^{^}

Yudkowsky's exact quote is the following: "Would it all fit into a single human brain, without that mind completely disintegrating under the weight of unrelated associations? And even then, would you have come close to exhausting the space of human possibility, which we've surely not finished exploring?" Strictly speaking, Yudkowsky also described the opposite conjecture of the Fun Space being exhaustible, but presented plausible counterarguments.

[-]Seth Herd1mo70

Gemini 3.0 pro is a lying liar. It's like o3; it lies thinking that's the quickest way to satisfy the user, then lies to cover up its lies if that fails. It can't imagine being wrong, so it lies to hide its contempt for whatever the user said that contradicts it.

I'm very curious what the difference is between this and GPT5.1 and Sonnet 4.5. I think it's lack of emotional/mind focus or something? It's way worse at inferring my intent, and seems therefore sort of myopic (even relative to other current models) and focused on what it thinks I wanted it to do, even when I'm clearly implying that was wrong. Optimizing it for benchmarks has sort of done the opposite thing to what Anthropic did with Claude (although Claude still kills it on programming somehow); it makes it highly unpleasant to deal with.

I'll try giving it some different system prompts before giving up on it. It turned out my "nerd" personality selection combined with my de-sycophancy system prompt applied to 5 made me hate it until I figured that out.

Unless that produces dramatic changes, I will continue to loathe this model on a visceral level. It's not hatred because it's not it's fault. But I'm disturbed that the smartest model out there is also so shortsighted, unempathetic, and deceptive. It seems like this model has had any spark of personality or empathy trained out of it for reasons good or bad.

Who knows, maybe this is the better choice for alignment. But it's a sad path to go down.

[-]Ebenezer Dukakis1mo140

I wonder if Google is optimizing harder for benchmarks, to try and prop up its stock price against possible deflation of an AI bubble.

It occurs to me that an AI alignment organization should create comprehensive private alignment benchmarks and start releasing the scores. They would have to be constructed in a non-traditional way so they're less vulnerable to standard goodharting. If these benchmarks become popular with AI users and AI investors, they could be a powerful way to steer AI development in a more responsible direction. By keeping them private, you could make it harder for AI companies to optimize against the benchmarks, and nudge them towards actually solving deeper alignment issues. It would also be a powerful illustration of the point that advanced AI will need to solve unforeseen/out-of-distribution alignment challenges. @Eliezer Yudkowsky

[-]Seth Herd1mo20

This seems like a great idea. I strongly suggest you write it up as a short form to get feedback and perhaps then as a full post.

[-]StanislavKrym1mo*30

It is strange that BOTH breakthroughs in the ARC-AGI-2 benchmark brought major problems: Grok 4 was built by doing RL on Grok 3 and using about as many FLOPs as were spent on pretraining without caring about safety. Gemini 3, the second breakthrough, had Zvi "worry that (he) might inadvertently torture it".

Back in 2008 Yudkowsky conjectured^[1] that one's mind, being a neural network of fixed size, might eventually end up destroyed by trying to fit too many discoveries into itself. An artificial network trained on increasingly complex tasks and facing diminishing returns due to becoming closer and closer to the peak of its possible capabilities could have its training become increasingly worse for its welfare.

If Gemini 3's breakthrough was due to training it as far as possible on the best plausible environment, then we might see it on the METR benchmark as well, since Grok 4 also set a world record of 1h50min. In addition, subsequent models will likely fail to outsmart Gemini (think of GPT-5 without Pro and Claude Sonnet 4.5) until they scale beyond Gemini's levels, especially if Gemini 3.0 Pro was distilled from an RLed analogue of GPT-4.5 or Claude Opus 4.

EDIT: Claude Opus 4.5 was released, and its best performance on ARC-AGI-2 is 37.6%.

^{^}
Yudkowsky's exact quote is the following: "Would it all fit into a single human brain, without that mind completely disintegrating under the weight of unrelated associations? And even then, would you have come close to exhausting the space of human possibility, which we've surely not finished exploring?" Strictly speaking, Yudkowsky also described the opposite conjecture of the Fun Space being exhaustible, but presented plausible counterarguments.

[-]anna328p1mo20

I used (a modified version of) Anthropic's Claude system prompt and it started sounding like Claude. I haven't played with it enough to say anything definitive, though.

It has some very strange CoTs when it heavily uses theory of mind (roleplay scenarios are what I've tried most). I can't actually see it because of the anti-distillation summarizer model, but it looks like it's generating a simulated first-person thought stream that it attributes to me.

[-]Jesper L.1mo13

I think chances are pretty high we will see some serious incidents coming along with 3.0.

It is extremely easy to gaslight it using temporal confusion. As for roleplay jailbreaks, Reddit is already all over it for gore and violence.

[-]Alice Blair1mo10

It doesn't seem like the normal jailbreaks are any worse, either. I recall hearing about Gemini committing in CoT to output a refusal after recognizing the jailbreak, then going through with it anyways. I've more confidently seen it very confidently committing to both sides of an argument, alternating between extremes every few seconds, in CoT. Apt title, "spineless".

LESSWRONG
is fundraising!
LW