I've been particularly impressed by 3.1 Pro's ability to do math problems. I have 3 problems that I like to pose to AIs in increasing levels of difficulty (all requiring or greatly aided by a postgraduate-level knowledge of mathematics).
Gemini 3.1 Pro and Opus 4.6 are the first models that could solve the first one, or even come close to a correct solution. Opus was unnecessarily verbose and appealed to some advanced mathematics jargon, while Gemini gave a much simpler, far more readable solution.
The second problem was eventually solved by Opus after a couple of false claims and some strong hints (and the final solution still had some inaccuracies), but Gemini just breezed through it and gave a solution that was both more general and more elegant than the one that I came up with. The problem and a solution sketch could be in the training data as a singular Reddit comment, but that didn't seem to help Opus and Gemini's solution appears to be novel.
The third problem takes a long time to solve and requires several indirect steps—I suspect asking an LLM to one-shot the solution is simply the wrong format and that a more Ralph-loop style approach might be appropriate. Opus was hopeless and I couldn't even get it to reason well about the problem even with direct hints. Gemini, however, got the first important insight then got lost from there, but some strong hinting about what to look for eventually led it to a correct solution.
I've noticed that while Opus' mathematical output is often vague, filled with jargon, and difficult to understand, 3.1 Pro is much easier to read and it seems to prefer to directly use elementary techniques rather than appealing to advanced theorems. Even when it is wrong, it is quite easy to see where the specific incorrect step is. It makes the output much more potentially useful overall. I could see it legitimately helping with advanced mathematical work.
I can add that Gemini 3.1 Pro did succeed in solving a hard math problem which stumped Gemini 3 Pro and all other models tested in the Science Bench.
I’ve been trying to find a slot for this one for a while. I am thrilled that today had sufficiently little news that I am comfortable posting this.
Gemini 3.1 scores very well on benchmarks, but most of us had the same reaction after briefly trying it: “It’s a Gemini model.”
And that was that, given our alternatives. But it’s got its charms.
Consider this a nice little, highly skippable break.
The Pitch
It’s a good model, sir. That’s the pitch.
Jeff Dean also highlighted ARC-AGI-2 along with some cool animations, an urban planning sim, some heat transfer analysis and the general benchmarks.
On Your Marks
Google presents a good standard set of benchmarks, not holding back the ones where Opus 4.6 comes out on top. I tip my cap for the quick turnaround incorporating Sonnet 4.6.
The highlight is ARC.
The highlight here is covering up Claude Opus 4.6, which is in the mid-60s for a cost modestly above Gemini 3.1 Pro.
Gemini 3.1 Pro overall looks modestly better on these evals than Opus 4.6.
The official announcement doesn’t give us much else. Here’s a model. Good scores.
The model card is thin, but offers modestly more to go on.
Their mundane safety numbers are a wash versus Gemini 3 Pro.
Their frontier safety framework tests were run, but we don’t get details. All we get is a quick summary that mostly is ‘nothing to see here.’ The model reaches several ‘alert’ thresholds that Gemini 3 Pro already reached, but no new ones. For Machine Learning R&D and Misalignment they report gains versus 3 Pro and some impressive results (without giving us details), but say the model is too inconsistent to qualify.
It’s good to know they did run their tests, and that they offer us at least this brief summary of the results. It’s way better than nothing. I still consider it rather unacceptable, and as setting a very poor precedent. Gemini 3.1 is a true candidate for a frontier model, and they’re giving us quick summaries at best.
Other People’s Benchmarks
A few of the benchmarks I typically check don’t seem to have tested 3.1 Pro. Weird. But we still have a solid set to look at.
Artificial Analysis has Gemini 3.1 Pro in the lead by a full three points.
CAIS AI Dashboard has 3.1 Pro way ahead on text capabilities and overall.
Gemini 3.1 Pro dominates Voxelbench at 1725 versus 1531 for GPT-5.2 and 1492 for Claude Opus 4.6.
LiveBench has it at 79.93, in the lead by 3.6 points over Claude Opus 4.6.
LiveCodeBench Pro has Gemini dominating, but the competition (Opus and Codex) aren’t really there.
Clay Schubiner has it on top, although not on coding, the edge over 2nd place Claude Opus 4.6 comes from ‘Analytical%’ and ‘Visual%.’
Mercor has Gemini 3.1 Pro as the new leader in APEX-Agents.
Brokk power rankings have Gemini 3.1 Pro in the A tier with GPT-5.2 and Qwen 3.5 27b, behind only Gemini Flash. Opus is in the B tier.
Gemini 3.1 Pro is at the top of ZeroBench.
It’s slightly behind on Mercor, with GPT-5.2-xHigh in front. Opus is in third.
Gemini 3 DeepThink V2
Gemini 3 Deep Think arrived in the house with a major upgrade to V2 a little bit before Gemini 3.1 Pro.
It turns out to be a runtime configuration of Gemini 3.1 Pro, which explains how the benchmarks were able to make such large jumps.
Those are some pretty powerful benchmark results. Let’s check out the safety results.
What do you mean, we said at first? There are no safety results?
After they corrected their initial statement, Google’s position is that they don’t technically see the increased capability of V2 as imposing Frontier Safety Framework (FSF) requirements, but that they did indeed run additional safety testing which they will share with us shortly.
I am happy we will got this testing, but I find the attempt to say it is not required, and the delay in sharing it, unacceptable. We need to be praising Anthropic and also OpenAI for doing better, even if they in some ways fell short, and sharply criticizing Google for giving us actual nothing at time of release.
It was interesting to see reacts like this one, when we believed that V2 was based on 3.0 with a runtime configuration with superior scaffolding, rather than on 3.1.
The jump in some benchmarks for DeepThink V2 is very large, so it makes more sense in retrospect it is based on 3.1.
When I thought the difference was only the scaffold, I wrote:
That seems right for a scaffold-only upgrade with improvements of this magnitude.
The V2 results look impressive, but most of the gains were (I think?) captured by 3.1 Pro without invoking V2. It’s hard to tell because they show different benchmarks for V2 versus 3.1. The frontier safety reports say that once you take the added cost of V2 into account, it doesn’t look more dangerous than the 3.1 baseline.
That suggests that V2 is only the right move when you need its ‘particular set of skills,’ and for most queries it won’t help you much.
Positive Feedback
It does seem good at visual presentation, which the official pitches emphasized.
This is at least reporting improvement.
Ask no questions, there’s coding to do.
Similar reports here for noncoding tasks. A vast intelligence with not much else.
I never know whether to be impressed by UI generation. What, like it’s hard?
Negative Feedback
The most basic negative feedback is when Miles Brundage cancels Google AI Ultra. I do have Ultra, but I would definitely not have it if I wasn’t writing about AI full time, I almost never use it.
One form of negative feedback is no feedback at all, or saying it isn’t ready yet, either the model not ready or the rollout being botched.
Or the perennial favorite, the meh.
Chong-U is underwhelmed by a test simulation of the solar system.
Andres Rosa: Inobedient and shameless, like its forebear.
Try Gemini Lite
Gemini 3.1 Flash-Lite is not also available.
They’re claiming can outperform Gemini 2.5 Flash on many tasks.
My Chrome extension uses Flash-Lite, actually, for pure speed, so this might end up being the one I use the most. I probably won’t notice much difference for my purposes, since I ask for very basic things.
And that’s basically a wrap. Gemini 3.1 Pro exists. Occasionally maybe use it?