466

LESSWRONG
LW

465
Computer ScienceLanguage Models (LLMs)AI
Personal Blog

28

Victor Taelin's notes on Gemini 3

by Gunnar_Zarncke
18th Nov 2025
Linkpost from x.com
4 min read
1

28

Computer ScienceLanguage Models (LLMs)AI
Personal Blog

28

Victor Taelin's notes on Gemini 3
2Vladimir_Nesov
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 1:19 AM
[-]Vladimir_Nesov2h20

every other model creates a 5x larger term... that doesn't work

Saw the same with my Fibonacci double quine test that Gemini 3 Pro is the first to solve (Sonnet 3.5 was the first to solve the previous tier of this problem last year):

Write a program in python that by default prints a number from the Fibonacci sequence, while on argument --next it prints a program in Java, which by default prints the next number from the sequence, while on --next it prints a program in python with yet another number by default, and so on, switching between python and Java while advancing the Fibonacci sequence. Don't use a class in python, use class Main in Java, avoid comments inside the code.

Gemini 3 Pro:

import sys;x=0;y=1;J="public class Main{public static void main(String[]a){int x=%d,y=%d;String P=\"%s\",J=\"%s\";if(a.length>0&&a[0].equals(\"--next\"))System.out.printf(P,y,x+y,J.replace(\"\\\\\",\"\\\\\\\\\").replace(\"\\\"\",\"\\\\\\\"\"),P.replace(\"\\\\\",\"\\\\\\\\\").replace(\"\\\"\",\"\\\\\\\"\"));else System.out.println(x);}}";P="import sys;x=%d;y=%d;J=\"%s\";P=\"%s\";print(J%%(y,x+y,P.replace('\\\\','\\\\\\\\').replace('\"','\\\\\"'),J.replace('\\\\','\\\\\\\\').replace('\"','\\\\\"')))if len(sys.argv)>1 and sys.argv[1]=='--next'else print(x)";print(J%(y,x+y,P.replace('\\','\\\\').replace('"','\\"'),J.replace('\\','\\\\').replace('"','\\"')))if len(sys.argv)>1 and sys.argv[1]=='--next'else print(x)
Reply
Moderation Log
More from Gunnar_Zarncke
View more
Curated and popular this week
1Comments

Victor Taelin of Higher Order Company has some of the hardest computer science problems the LLMs most likely have never seen before and evaluated Gemini 3 on them. Here is his tweet reproduced almost in full.

Short Version

First of all: you've all seen the benchmarks, so I don't think you need me to judge this one. Still, based on my tests, this is as real as it gets, and I want to talk about it. This model outperforms GPT-5 Pro, Gemini 2.5 Deep Think, and everything else, on my hardest problems, by far.

It is the new SOTA at:
→ debugging complex compiler bugs
→ refactoring files without logical mistakes
→ solving difficult λ-calculus problems
→ ASCII art (it is almost decent now!)
→ Competitive Gen 3 OU (won't elaborate 😭)

It is still an LLM, though. It has similar failure modes, and is worse than Sonnet / GPT-5 in some scenarios.

It seems very bad at:
→ inferring intent
→ not going overboard
→ one-shot vibe coding
→ creative writing
→ health questions

Also, I suspect this checkpoint isn't the best Google has.

Now, on to a complete, manually typed Gemini 3 overview.


Long Version

1. Vibe Test: λ-Calculus Programming

I like this vibe test because it is cheat proof: as soon as a model solves my hardest problem, I just create a harder one, moving the goalpost a little bit.

Current track record:
→ n-tuple-map: solved by Grok 3 (Feb 2025)
→ n-tuple-fold: solved by GPT-5 (Aug 2025)
→ n-tuple-rotl: solved by Gemini 3 (today!)

Implementing rotl is impressive, because:
→ no other model gets even close to solving it
→ Gemini's solution is 2x simpler than my own's
→ Gemini 3 solved it 5x faster than myself

In fact, its solution is so short, it fits a single line:

λn.λt.λk.(tλx.(nλg.λh.(h(gλq.λr.λs.(q(r s))))λu.λf.(f x)λu.u k))

Pretty, isn't it? This minuscule function is a generic rotator for λ-encoded tuples. For a comparison, every other model creates a 5x larger term... that doesn't work. Nothing else creates anything like that, at all. That answer was a jaw-dropping moment for me.

I'll post a Gist with the prompt, and a runnable solution.

2. Real-World Debugging

A few weeks ago, I was stuck in a tough HVM4 bug. I posted about it here, lamenting that no AI could solve it. Well, turns out, an AI did solve it. Someone managed to run my prompt on an old checkpoint via LMSys, and it nailed the issue: a stack underflow on a specific optimization. The solution was simple: replace `WNF_SPOS < 0` by `WNF_SPOS < spos`.

This was the first time an AI solved a real problem before I could solve it myself, and that let me move on to the next thing, leading to great progress.

I now tested Gemini 3 on many old debugging prompts, and it solved half of them. Bugs that took me hours are now solved by a computer in a minute!

Now, and sadly: I've sent the stack underflow issue to the Gemini 3 model that launched today, and it failed to spot the culprit. So, either this model is not the smartest variant Google has, or I was really lucky that day. ):

3. Real-World Coding / Refactoring

This model is great at writing large files with no mistakes.

For example, I've asked some AIs to rewrite HVM4.hs with a few core changes. I then inspected the trickiest functions. As you'd expect, all models failed on important details that lead to serious bugs - except for Gemini 3, which nailed it.

This is particularly encouraging to me, since models struggled with niche restrictions like linearity, making them unusable for core HVM functions. Gemini 3 seems extremely competent at it.

(Code in the image below.)

4. Problems and Quirks

I've asked Gemini 3 to one shot a web app, and it did poorly. GPT-5.1 (high) gave me complete file, with everything I asked for, zero bugs. Gemini 3 gave me a 3x smaller file, with tons of missing features and many bugs. I couldn't get Gemini 3 to one shot a file nearly as good as the one by GPT-5.1.

I've also sent it my pre-meningitis prompt, and it completely discarded meningitis (💀). Not sure I'd trust it for health issues.

I've asked it to write a story, and it was bad. No excuses.

It will often go overboard and write full files instead of just patching holes.

It seems slower than even GPT-5 (high) on Gemini CLI, although it is much faster when called directly?

Lastly, and funnily, it struggles to produce incorrect code when context calls for it. For example, if your type has a generic "Show" instance, and you ask it to build a custom one, it will often struggle, because that'd technically be an error. It can't grasp that you'd just remove the old one afterwards!