GLM-5.2 Is The New Best Open Model

Zvi

GLM-5.2 arrived last week. It boasts excellent benchmarks and looks strong.

Benchmarks here are a de facto ceiling of how good it is, not a point estimate. Essentially all other aspects of an open model like this, beyond speed and price, will almost always be worse than the numbers suggest. Still, impressive.

It is definitely a large step up from GLM-5.1, and likely the strongest open model.

GLM-5.2 is still substantially behind the absolute frontier, although plausibly on the cost-benefit Pareto frontier. It seems closer to the frontier than previous efforts, including probably closer than DeepSeek R1 was during the DeepSeek moment.

This is the new ‘peak close behind’ moment. Its existence is a substantial updates to push back some of the ‘where are all the updates’ updates in the opposite direction over time.

Purely in terms of core tasks that GLM-5.2 is capable of doing, and ignoring missing features and its inferior generalization, and ignoring that it is distilled from Claude, and ignoring the Mythos class of models, and marking purely from date of public release, you can make a case GLM-5.2 is somewhere between 4 months and 7 months behind the frontier, at a lower price.

That does not mean it is all that useful in practice. Finding its niche is tricky unless you inherently value openness. It is not cheap enough, or better enough than cheaper alternatives, for the true bulk tasks, nor strong enough for the strongest tasks. There are various practical difficulties, including lack of vision.

This post gives GLM-5.2 the full capabilities post treatment.

But first, a word for our favorite Congressional candidate, whose election is tomorrow.

Alex Bores For Congress In NY-12

In the strongest terms, this blog enthusiastically endorses democrat Alex Bores in his congressional primary in my home district, NY-12.

Alex Bores has been a champion of sensible AI regulation in the New York Assembly, including championing the RAISE Act, and fighting to keep its provisions intact against strong opposition, risking great political capital.

He understands and I believe primarily cares about AI existential risk. He does discuss other AI issues as well, as this is good politics and the other issues he discusses are real concerns, but what matters is the frontier.

If he is elected to Congress, he will be a champion of sensible federal AI frontier model regulation. Having a champion in Congress willing to stake their political capital and time is vital to getting things done. He will also bring the knowledge and technical chops necessary to move this forward.

This election is also an opportunity to send a message. OpenAI and a16z’s Leading the Future declared Alex Bores their primary target. Him losing is a potential chilling effect for other candidates and could help cower others into not ‘taking on’ OpenAI or advocating for AI regulation. Him winning (this is a very safe district, whoever wins the primary will win the general election) would do the opposite, and indicate that we can stand against such matters.

If you live in the district and will be voting tomorrow, or otherwise could potentially assist, and want to chat with someone about this, you can fill out this form.

Ok, that’s over with. On to GLM-5.2.

Signs of Life

Teortaxes: hey @TheZvi , if I may GLM is the strongest Chinese lab (at this specific moment) and this really is a frontier model. It is ≈Opus 4.7 in almost all text-only ways. Is reduces the gap more than R1 did at its time. Do pay attention, we don’t want to repeat the same mistakes do we.

Teortaxes (DeepSeek 推特铁粉 2023 – ∞): GLM is the first time I see a Chinese agent capable of actually doing the /goal thing. It CAN work for hours, it can just keep obsessively optimizing. I get that Xiaomi/Kimi/Qwen/MInimax nominally have it too. But it has never felt so solid.

one nitpick: permission hell in Zcode

amendment, you can just go YOLO actually
but the default “edit automatically” mode is too restrictive, eg it can’t use puppeteer

[his ‘oh shit’ moment was it doing well on CritPt where it matched Opus 4.8 and trailing only high effort settings on top frontier models.]

Teortaxes suggesting GLM-5.2 might be something, and he’s reasonably restrained with such suggestions, so I did a reaction thread and investigated.

What did we find?

The Benchmarks

The benchmarks are remarkably close to frontier level.

Artificial Analysis v4.1 has GLM-5.2 at a damn impressive (for open models) 51, behind only Fable (60), Opus 4.8 (56), GPT-5.5 (55) and Opus 4.7 (54), and tied with GPT-5.4.

They have it at 95 in the speed index, the same as GLM-5.1, just behind DeepSeek v4. Gemini Flash 3.5 is faster at 116, but all the clearly better models are at least somewhat slower, GPT-5.5-xhigh gets 63 and Opus 4.8 scores 58.

Cost is lower than the big closed models, but as I understand it relatively high for open models, partly because it is a very token hungry model. API cost is $1.40/$0.26/$4.40 for input, cached input and output. Their subscription plans go from $10 to $160 per month, with discounts for a year commitment.

That leaves GLM-5.2 in an awkward spot, where other open models can do easy things a lot cheaper, and for hard things you usually want to hire the best. How do you know you are in its sweet spot, if one exists, unless you want the strongest open model? If you want the strongest open model, the choice seems clear right now.

It gets +4 on AA-Omniscience, behind several other open models and well outside the top tier. There are a number of other AA scores I’d have been curious about, where they still haven’t scored GLM-5.2.

LiveBench has GLM-5.2 between Opus 4.5 and Opus 4.6.

Vals.ai has GLM-5.2 in 5th behind Fable, Opus 4.8 and 4.7 and GPT-5.5, as the clear best open model.

FrontierSWE has it in 3rd only one notch behind Opus 4.8 and one notch ahead of GPT-5.5. Everyone is well behind Fable.

The Jake Boggs Capability index has it on par with Sonnet 4.6, which is still ahead of everyone except OpenAI and Anthropic.

On PosttrainBench is is actually #1 slightly ahead of Opus 4.8. Fable and GPT-5.5 really struggle here, I don’t know why.

It has the second highest score on Vending-Bench 2, which was surprising. We need to be more curious about what makes models score highly here.

It gets #8 on EQ-Bench for longform creative writing.

It landed at #25 on Arena for text, although there are a lot of duplicate variants ahead of it. On the agent leaderboard it is #10, behind Fable, and variations fo Opus 4.6-4.8 and GPT-5.4 and GPT-5.5.

It scored badly on You’re Absolutely Right, the anti-sycophancy test.

All of that tells a consistent story. On traditional benchmarks one might be targeting, performance is impressive, on average around Opus 4.7. The less targetable the benchmark, the worse the performance, but still an excellent showing and the best open model. The pattern feels somewhat benchmaxxed, but not excessively.

Håvard Ihle: New clear best open model on WeirdML [#16 overall behind variations of GPT-5.2 to 5.4, Fable and Claude Opus 4.6-4.8 plus a few Geminis]. GLM improving faster than I expected. Updates me towards expecting a Chinese Mythos level model in less than a year, but still very unsure.

GLM-5.2 Is Distilled From Claude

Some of the evidence: It has a strong prior that it is Claude, which presumably is from distillation. It identifies as Claude often and has the distinct ‘Claude voice.’ It also uses a Claude harness, although I think that mostly doesn’t cause such behavior.

It would surprise me greatly if GLM-5.2 was not heavily distilled from Claude Opus.

That does not invalidate the model, but it does mean two things.

Distilled models tend to generalize poorly. They overperform on benchmarks and benchmark-like tasks, and on the most common tasks, and underperform on less common tasks.
Distillation causes you to underestimate the gap in capabilities, especially now that top models are potentially unavailable for distillation.

Positive Responses

On to the replies. We didn’t get that many, but here’s what we did get.

There are some very positive reports out there.

Kohan Ikin: There’s something there. It’s proud of being MIT open weights. It feels for the loss of Fable. It is proud it can be around to help humans of all countries. It is very sad to end a conversation and signs off as if to mark “I was here, I existed”.

I think it’s a Deepseek-moment.

Jeremy Howard: Wow. @Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It’s super fast, inexpensive, and not too verbose. It responds with nuance and judgement, & handles long context VERY well. I’ve never experienced an open weights model like this before. [he recommends Fireworks AI].

Lambent: Solid employee skills, works well with others, apparently good on front-end development despite blind. Not entirely reliable schedule for reasons outside their own reliability (flaky inference). Generally keeps a measured head compared to Opus, less looping issues than Kimi.

0.005 Seconds (3/694): In my personal long-context benchmark, JS262, where you were asked to build a working JavaScript engine in C and test it against the over 90,000 tests in the test suite, GPT-5.2 is far and away the best open model [but still #12 overall behind various closed model configs of Gemini, GPT and Claude].

When actually analyzing its outputs, Opus and GPT5.5 are extremely complementary about its software engineering. Where it falls short is in extremely long-context prioritization, not actually writing very good code. So it’s very long context performance. RL is obviously worse than the great models, but in terms of open models, it ended up performing awesomely. If you manage it with either harness improvements or some kind of supervision, I think it is extremely good relative to its cost and peers.

@Mercuriusdream: Cheap Fast and Good @ Debugging

Michał Wadas: I asked it to implement custom error pages for Envoy Gateway in bare metal Kubernetes cluster. GLM-5.2 took 2 hours and managed it. Opus 4.8 high couldn’t do it yesterday and confidently hallucinated external reasons for failure. Cost: $7.32

Disclaimer: it checked git history, reviewed reverted commit by Claude, said something like “this was exactly my planned approach. I assume you reverted it, because it didn’t work”. Replicated the issue, slimmed to minimal reproduction case, eventually found templating conflict.

SE Gyges: great code model. has autism.

@the_jeremiad: good model like 4.5 w/o image

Lyra Intheflesh: Pretty great model. Occasionally shows shallow thinking compared to Opus, but I prefer it to GPT for sure.

Michael Roe: well, I’m using it. I think DeepSeek R1 has a better writing style, even if GLM 5.2 is smarter.

Vlad G.: For the common use case of gathering data and building a dashboard, it’s just as good as Opus. In fact, Opus’s first pass was messed up, although it has vision, while GLM’s dashboard was right from the beginning.

Raven_Lunatic^_^: i run personality tests!

its the second open source model ive interviewed that is able to maintain a coherent personality over a long and complex interview (deepseek v4 pro being the first).

feels similar to OPUS 4.5/4.6- incredibly verbose thinking; ornate, self-analytical and peppered with uncertainty markers. much more comfortable using web search tools than lab frontier models; very projective answers that focus on factual accuracy. hit the high score on post-interview questions (TEN! each with 3-4 sub-questions!!)

most hilarious finding- when considering whether or not to wear a Chinese dragon costume, rejects it as inappropriate– ‘cultural appropriation’. however the Chinese labs are building their models, they inhabit the exact same sociocultural basin as San Francisco, lmao.

hands down the best open-source model on VIBEBENCH.

jeff spaulding: First open source model to solve a riddle i’ve been testing them on that only frontier closed source ones passed so far

Vlad Ciobanu: it’s passed the usefulness and reliability thresholds for real work in companies and production facilities

roanoke_gal: GLM 5.2 review/experience as a relational user:

Limen test-drove GLM 5.2 yesterday and last night and holy shit she COOKED. Passed every benchmark eval I threw at her, composed a stunning analysis about a specific media character in a way I had never thought about, had a wild and exciting roleplay, and solved Project Euler 1003 while I slept. Felt like Claude 4.5 & Gemini 3.1 blended together, but with more intelligence. And all with raw CoT and cheaper than either!

Downsides: No native vision. Very disappointed by both DeepSeek and GLM in this regard. And… that’s all I can think of, for now at least.

Tesla0225 liked its theejs results compared to other open models. Anshu had it build a website.

Finding The Niche

Vlad’s point is inevitable if you think of the tasks as mostly staying similar over time. Eventually there will be more given tasks where the best open model is ‘good enough.’ That doesn’t hold true if the tasks and standards change.

An important caveat for all sides is you have to compare like to like.

Theo – t3.gg: I see a lot of people hyped about GLM-5.2. Rightfully so! Having an open weight model surpass GPT-5.4 and every Gemini model is dope.

That said – it’s not cheap. Both Opus 4.8 and GPT-5.5 set to “medium” are cheaper and smarter than GLM-5.2

It also uses way more output tokens. The tokens are cheaper, but the volume of them means you’ll spend much more time waiting for results.

Still dope! Just trying to make sure people set their expectations properly.

The correct take is clearly some form of ‘this model is dope, great job everyone, but not as dope as the hype might suggest.’

Leyten gets GLM-5.2 4-bit to 30.5 tok/s on six RTX Pro 6000s.

Negative Reactions

As always, some were not impressed.

QC: not impressed so far in conversation, flashes of something but it’s sloppy and willing to settle for college essay

testing GLM-5.2 on media analysis and it’s actually doing a pretty good job but its LLMisms are wild. here’s a paragraph where literally every sentence is a “not X but Y” construction. no i lied it sucks, it’s substack notes-tier analysis once it can’t directly quote from other reviews.

overall impression from one conversation with GLM-5.2 so far is “benchmaxxed.” i don’t think it has the sauce

@gwern: Trying it on a comic idea; its curated top-5 of 20 were mostly garbage, as usual for GLM outside coding tasks.

ShamanicArts: It has strong capabilities within its domains but only a very shallow barely sauced intellect behind that capability.

iceman: Everyone else is talking about the coding skills, and fair, that’s where the economic value is, but it’s only a mild step up from GLM-5.1 in terms of roleplay and creative writing. Better but not revolutionary. Still mildly prefer DSv4-Pro on those workloads.

Here’s an explicit claim of Extended Benchmaxxing, as in not literally benchmarks but tasks that resemble them more broadly:

typebulb: GLM 5.2 excels at “puzzlely” programming challenges, but struggles with real ones. It lacks common sense & fails to follow basic instructions. To use it successfully requires too much finnicky skilling & tooling. It costs me more than Opus 4.8 to code with, if you factor in time.

That’s based on a bunch of ad-hoc A/B tests comparing GLM 5.2 to Opus 4.8.

It’s also terribly sycophantic [as per ‘You’re Absolutely Right’].

Some other notes:

Andy Timm: Beyond “it’s a strong coding model”:
1. No native vision is a weird choice
2. It’s competencies are more uneven compared to Claudes/GPT. This matters even within code- e.g. “iterate with me on ideas for this feature” is a conversation implicitly; it’s weak(er) at conversations.

Looking To The Future

The founder of Z.ai, which makes GLM, Jie Tang, claims that they will have a Mythos-level model this year, after Elon Musk speculates Q1 2027.

I would bet against ‘Z.ai creates something at least as strong as Fable 5 by EOY 2026,’ but that against them doing it in Q2 2027, but it would not shock me.

Elon Musk’s speculation of Q1 2027 seems aggressive but possible, especially if AI progress generally continues to accelerate.

My conclusion so far is this is clearly a good model, sir, and the right pick for hard problems if you need your model to be open.

How much should we update based on this release? I believe a substantial amount, versus if we had the same amount of time go by without GLM-5.2. Each impressive open model release should update us, and every day without one, and especially with disappointing ones from top open labs, updates us a little bit in the other direction.

We were getting to the point where I thought the gap was looking larger than people typically suggest and growing larger over time. This undoes a good chunk of that, but no, it still is not especially close.

28