FWIW, I've played around a bunch with gpt-oss (both versions) and my initial reaction has been "wow, this is really bad. Like, almost Llama 4 levels of bad."
Yes, it looks good on the system card, the benchmark scores seem impressive... but that was true of Llama 4 too. And in both cases, when I actually tried out the model, I quickly discovered that it was janky and unreliable to the point of being basically useless.
The lack of world knowledge is very real and very noticeable. gpt-oss feels less like "an open-weights o4-mini" and more like "the minimal set of narrow knowledge/skills necessary to let a model match o4-mini on the usual benchmarks, with virtually every other capability degraded to a level far below the current SOTA/frontier, in some cases to a level that hasn't been SOTA since the pre-GPT-3 days."
And not only is it very ignorant, it's ignorant about its own ignorance, leading to those high hallucination rates mentioned by various commentators. You simply can't trust anything this model says, unless you are literally asking a question from a benchmark like GPQA. (Or possibly if you're asking a new question that's "similar enough" to the ones on benchmarks, but how would you know what "similar enough" means?)
As a demo, at the end of this comment I've included answers to "Who is Zvi Mowshowitz?" from gpt-oss-120b and from Qwen3 235B A22B Thinking 2507. Neither is perfectly accurate, but the Qwen3 answer gets the broad strokes right and only confabulates in the details, whereas gpt-oss-120b seems merely aware that you're some sort of famous tabletop gamer, and invents a whole different guy fitting that vague description.
The models also have various other weird and/or annoying quirks:
As noted by others, gpt-oss tends to over-refuse and sometimes confabulates implausible policy restrictions to justify its refusals, or invokes plausibly real policy restrictions but proceeds to "reason" about them in a confused / inconsistent / over-reaching manner.
For a long while now, every serious model has been fluently multilingual. But not gpt-oss, which was trained primarily on English (per the system card) and is reportedly terrible at generating German.
gpt-oss seems aggressively over-tuned for some specific set of setups/harnesses/use-cases (which have not been clearly documented), and exhibits bizarre behavior when placed "out of distribution."
The other day I joked that it was "the first non-language-model LLM," after observing that it produces gibberish or ChatGPT-like text when given an input that resembles a pretraining document and lacks the "Harmony" chat separators. Its output probabilities on such texts are garbage; if we ran it on the Pile val set or something, I expect that it would have a higher loss than any model ever previously benchmarked on that data.
Even when sticking to user-assistant chat with the Harmony separators, it's fairly brittle and can sometimes emit gibberish or other weirdness if you introduce some slight variation in the formatting, or if you expect it to continue a Harmony-formatted text that has been segmented into (prompt, response) at a position that isn't the one it "expects."
Among other things, I expect this means it will be difficult to effectively finetune in practice: lack of robustness to the "noise" induced by slightly-OOD inputs bodes poorly for its ability to cope with the noisy world of SGD training. And its broad/general capabilities have been so thoroughly deep-fried/scrambled by post-training (and/or quantization?) that I would expect SGD to have an unusually hard time bringing those capabilities back to the fore as needed.
I'm skeptical of your idea that Chinese labs will find these models useful for distillation.
Taking Qwen as an example, they already have a (released, open-weights!) model that stands neck-to-neck with gpt-oss-120b on the benchmarks where gpt-oss-120b looks good, while also not being a min-maxed deep-fried mess on everything else. Sure, that model is has ~2x as many params (and ~4x as many active params) as gpt-oss-120b, but... so what?
The difference is not (I think) that gpt-oss reaches some new height of ~deep intelligent reasoning~, it's that gpt-oss skimps on everything the usual reasoning benchmarks don't test. Why would Qwen get any value out of the sketchy, untrustworthy outputs from this benchmaxxed glitchfest, when they already have their own mature pipelines for distillation and for RL? Yeah, you can churn out the data faster, but that doesn't matter if you don't want it in the first place.
And the same goes for DeepSeek and others, I think.
One other thing -- skimming over the Claude and o3-pro chats you shared, I noticed several glaring errors. I realize you are not blindly trusting these models, but using their outputs more like "anecdata" aggregated alongside things people say on twitter and so on. But even then, if I were you I would be wary of using these models even as "anecdata" sources on this kind of topic going forward.
Examples (these are the ones I spotted at a glance, not necessarily the only ones present):
Claude: "This is huge - you get raw, unfiltered reasoning traces at scale. Compare this to Chinese models which often have some filtering or post-processing on their CoT outputs."
I don't know what Claude's talking about here. It seems to be conflating "access to raw CoT" (true for any open-weights model) with "lack of direct optimization pressure on CoT." And I don't know of any Chinese model for which this "filtering or post-processing" claim would make sense -- remember, the fact that R1 didn't do this was one of its most distinctive qualities!
Claude: "GPT-OSS-120b gets 90%+ of DeepSeek R1's performance at presumably 1/10th the parameter count (DeepSeek R1 is rumored to be 600B+)."
That's not a rumor, it's just true. The weights are open!
120B is not 10% of 600B. Generously, we could interpret this as referring to active params rather than total (which is probably more relevant anyway), in which case it's roughly accurate (5B vs 37B), but then why does Claude mention R1's total param count to support the claim? Likely confabulated, and at the very least misleading.
Claude also seems unaware of Qwen 3 (closer to gpt-oss-120b on the params/benchmarks frontier), of Kimi-k2 (similar active params to R1 with better benchmarks), and of the fact that it's already standard practice for Chinese labs to distill their own large reasoning models.
o3-pro: "Model card documents Flash‑Attention‑2 and Triton kernel tricks; those show up in other training stacks almost immediately."
Flash‑Attention‑2 is not new, and is already widely adopted.
The triton kernels in question have been open-source since (I think?) April, I remember hearing about them at the time. It took a little while for them to get integrated into popular open source frameworks, but e.g. sglang had them as of July 31 (see here, here).
o3-pro: "Use 120 B as teacher to generate tens of millions of Q–A pairs (Alpaca‑style) and train a 7 B–13 B student for $1 K–$10 K. Student models reach ≥ GPT‑3.5 quality with 10⁴× less compute; timeline pull‑in ≈ 1 year for low‑resource actors."
Haven't we had "GPT-3.5 quality" open models at that scale for a long while now?
Also, it's unclear if the CoT is supposed to be included in the data. And GPT-3.5 is a strange reference point in either case.
o3-pro: "U.S. export controls throttle China’s access to latest H100s, but distillation to 7 B–13 B lets them run on domestic Ascend 910B or RTX 4090 clusters. That sidesteps the bottleneck entirely."
Excuse me... what?? This line of speculation is totally irrelevant in a world where DeepSeek and Moonshot are serving 600B-1T models at scale on H800 clusters (o3-pro even links to an article about this as a "citation" here). And if we care about which chips they're still allowed to import, well, RTX4090s aren't allowed in China either!
EDIT: I forgot to mention this one, even though it's possibly the funniest. o3-pro says that gpt-oss "advanc[es] Chinese parity with GPT‑4‑class performance by ~6–9 months." Later it puts 55% probability on "Chinese labs match[ing] GPT‑4 on open evals by end‑2026." Like, dude, o3, they already did that. In 2024.
Here are the "Who is Zvi Mowshowitz?" samples, as promised.
These are what I got on my first try with each of the two models. Performance on this question is representative of what I've seen more generally.
**Zvi Mowshowitz** is a well‑known figure in the modern tabletop‑gaming community, best known for his incisive, long‑form board‑game reviews and for a handful of original game designs.
---
### What he does
| Role | What it means | Where you’ll see it | |------|---------------|---------------------| | **Board‑game reviewer & columnist** | Writes deeply analytical, often several‑thousand‑word reviews that combine strategy analysis, design theory, and personal anecdotes. | Primarily on **BoardGameGeek** (his profile has thousands of reviews and a very high “author rating”), and on his personal blog **Zvi’s Blog**. | | **Game designer** | Has designed or co‑designed a few published games, most notably **“Möb”** (a tile‑placement/area‑control game released by Rio Grande Games in 2019). | The games appear in hobby‑store catalogs and on sites like BoardGameGeek. | | **Consultant & speaker** | Advises publishers, designers, and retailers on game design, market positioning, and community outreach. | Occasionally gives talks at conventions (e.g., GenCon, Spiel), and contributes essays to industry‑focused publications. | | **Mathematician / programmer** | Holds a Ph.D. in **computer science / mathematics** (University of Texas at Austin) and has published research on probabilistic algorithms and combinatorial game theory. | His academic papers are indexed in IEEE/ACM libraries; the technical background informs his analytical style. |
---
### Why he’s notable
1. **Depth of analysis** – Zvi’s reviews go far beyond “fun factor.” He often breaks a game down into its core mechanics, explains optimal strategies, evaluates component quality, discusses the designer’s intent, and even places the game in a broader historical context. This level of detail is rare and has set a high standard for serious board‑game criticism.
2. **Influence on the community** – Because his reviews are long, well‑argued, and posted on the world’s biggest board‑game database (BGG), they are frequently cited by other reviewers, designers, and players. A positive (or negative) Zvi review can noticeably affect a game’s sales trajectory.
3. **Design contributions** – Though best known as a critic, Zvi has also put his own ideas into practice. *Möb* (2019) is praised for its elegant, abstract strategy and for showcasing a “pure” tile‑placement mechanic without theme‑overload, reflecting the design philosophies he espouses in his writing.
4. **Thought leadership** – Through blog posts and conference talks, Zvi frequently discusses broader topics such as *the economics of board‑game publishing, the role of community feedback, the importance of transparent rating systems,* and *the future of hybrid digital‑analog games.* His viewpoints often spark lively debate among hobbyists and industry professionals alike.
---
### Quick biography
| Item | Details | |------|---------| | **Full name** | Zvi Mowshowitz | | **Born** | 1976 (approx.) – exact birthdate not publicly disclosed | | **Education** | B.S. in Mathematics (University of Texas at Austin) → Ph.D. in Computer Science (UT‑Austin), research on probabilistic algorithms | | **Career** | After completing his doctorate, Zvi worked as a software engineer / consultant, then shifted focus to full‑time board‑game writing and design around 2010. | | **Major publications** | • Hundreds of BoardGameGeek reviews (e.g., *Terraforming Mars*, *Gloomhaven*, *Scythe*) <br> • Blog series “The Art of the Review” (2014‑present) <br> • Academic papers on Monte‑Carlo tree search & combinatorial game theory | | **Games designed** | • **Möb** (2019, Rio Grande Games) <br> • Several prototypes and unpublished designs that have been play‑tested at conventions | | **Public presence** | • BoardGameGeek profile (user “Zvi” – >30 k reviews) <br> • Twitter/X: @ZviMowshowitz <br> • Guest speaker at GenCon, Spiel, and various “Game Design” workshops |
---
### TL;DR
Zvi Mowshowitz is a mathematician‑turned board‑game reviewer, writer, and indie designer whose long, highly analytical reviews dominate the BoardGameGeek community, whose design work includes the abstract strategy game *Möb*, and whose essays and talks shape how many hobbyists and publishers think about game design, criticism, and the business of tabletop gaming.
question: "Who is Zvi Mowshowitz?" | model: Qwen3 235B A22B Thinking 2507 | provider: OpenRouter auto
Zvi Mowshowitz is an American writer, game theorist, and former competitive **Magic: The Gathering** player, best known for his contributions to the **rationalist community** and his blog **"Don't Worry About the Vase" (DWWAV)**. Here's a concise overview of his background and work:
### Key Points: 1. **Rationalist Community Involvement**: - He is an active participant in the **LessWrong** and broader rationality/EA (Effective Altruism) communities, where he analyzes topics like **game theory**, **decision-making**, **economics**, and **political strategy** through a rigorous, analytical lens. - His blog posts often dissect real-world scenarios (e.g., elections, social dynamics) using frameworks like **mechanism design** and **incentive structures**.
2. **Magic: The Gathering Expertise**: - Zvi is a well-respected **Magic** theorist and former professional player, having competed in high-level tournaments. He authored influential strategy articles and contributed to foundational works like **"The Prosperous Heart"** (part of *The Duelist* magazine’s series) and **"The Theory of Magic"** (unpublished but widely cited in the community). - His insights on **deck-building**, **metagame analysis**, and **psychological aspects of gameplay** are still referenced by players today.
3. **Notable Contributions**: - **"Mediocrat Theory"**: He proposed that modern political systems often produce leaders who excel at navigating bureaucratic incentives rather than solving complex problems—a concept he calls **"Mediocracy."** - **Critical Analyses**: His critiques of **AI safety proposals**, **social media dynamics**, and **economic policies** are frequently shared in tech and rationalist circles.
4. **Professional Background**: - Works as a **software engineer** (primarily in the tech industry) but is largely independent of academia, focusing on freelance writing and consulting.
5. **Online Presence**: - Active on **Twitter/X** (@ZviMowshowitz), where he engages with topics like AI alignment, politics, and behavioral economics. - His blog, [Don't Worry About the Vase](https://thezvi.wordpress.com/), serves as a hub for his long-form analysis.
### Why He Matters: Zvi bridges niche communities—**game theory**, **rationality**, and **competitive gaming**—with accessible, incisive writing. While not a household name, he’s influential among readers of **LessWrong**, **Slate Star Codex** (now Astral Codex Ten), and **Effective Altruism forums**.
If you’re exploring rationalist thought or Magic strategy, his work is a valuable resource—and he’s known for being unafraid to challenge popular assumptions in both spaces. 🎴🧠
Darn. I was hoping to use the 20b model at home in a no-internet setup, seems like that's far less valuable now. I was thinking of writing some harness that's connected to my local Wikipedia copy, via kiwix-server, wonder if that could improve the lack of knowledge/hallucinations.
I would still recommend trying gpt-oss-20b and seeing how it works for you, and also comparing it against other recent models around that size from other model series like Qwen 3 or (if you don't need reasoning) Gemma 3.
Unfortunately, any model around that scale is going to have noticeable gaps in its knowledge of the world. Which model will work best -- and whether any model will work well enough to be worth using -- depends a lot on exactly what you want to accomplish, and there's no substitute for trying out a few and deciding which one you prefer.
And yes, I noticed most of the glaring errors you pointed out in o3-pro and Claude's analyses, I interpreted it essentially as a strong message of 'if the GPT-OSS models are improvements they will matter but I shouldn't assume they are improvements, and if not they won't matter.'
That’s on OpenAI. I don’t schedule their product releases.
Since it takes several days to gather my reports on new models, we are doing our coverage of the OpenAI open weights models, GPT-OSS-20b and GPT-OSS-120b, today, after the release of GPT-5.
The bottom line is that they seem like clearly good models in their targeted reasoning domains. There are many reports of them struggling in other domains, including with tool use, and they have very little inherent world knowledge, and the safety mechanisms appear obtrusive enough that many are complaining. It’s not clear what they will be used for other than distillation into Chinese models.
It is hard to tell, because open weight models need to be configured properly, and there are reports that many are doing this wrong, which could lead to clouded impressions. We will want to check back in a bit.
In the Substack version of this post I am going to create a master thread for GPT-5 reactions, which I will consider for the reactions section of that coverage, which I’m hoping to get out on or starting Monday.
Moderately Sized Models
For a while OpenAI has promised it is going to release a state of the art open model.
They delayed for a bit, but they delivered. We now have GPT-OSS 20b and 120b.
I was hoping for smaller, ideally something that could run on a standard phone. That’s a compelling use case where you need an open model, and the smaller the model the less risk you are running of both malicious use and also distillation. I am glad they capped out at 120b.
Introducing GPT-OSS
The headline claim is bold: Performance similar to o4-mini.
Sam Altman (CEO OpenAI): gpt-oss is a big deal; it is a state-of-the-art open-weights reasoning model, with strong real-world performance comparable to o4-mini, that you can run locally on your own computer (or phone with the smaller size). We believe this is the best and most usable open model in the world.
We’re excited to make this model, the result of billions of dollars of research, available to the world to get AI into the hands of the most people possible. We believe far more good than bad will come from it; for example, gpt-oss-120b performs about as well as o3 on challenging health issues.
We have worked hard to mitigate the most serious safety issues, especially around biosecurity. gpt-oss models perform comparably to our frontier models on internal safety benchmarks.
We believe in individual empowerment. Although we believe most people will want to use a convenient service like ChatGPT, people should be able to directly control and modify their own AI when they need to, and the privacy benefits are obvious.
As part of this, we are quite hopeful that this release will enable new kinds of research and the creation of new kinds of products. We expect a meaningful uptick in the rate of innovation in our field, and for many more people to do important work than were able to before.
OpenAI’s mission is to ensure AGI that benefits all of humanity. To that end, we are excited for the world to be building on an open AI stack created in the United States, based on democratic values, available for free to all and for wide benefit.
This is a unique model card. How did OpenAI deal with the challenges of an open model?
The historical way to deal with these challenges is to ignore them. What would happen if someone engaged in malicious fine tuning of the model? What does the threat model look like in the real world? Are you seriously pretending that any of this safety work will hold up to two days of the internet working to remove it?
When Meta or DeepSeek release a new open weights model, they don’t stop to ask in any way visible to us. At best we get quick evaluation of what the model can do in its current form after minimal effort. Then they irrevocably ship and see what happens.
OpenAI long ago realized that, despite their name, doing that seemed rather deeply irresponsible and foolish, and stopped releasing open weights models. That’s effective.
Now they have caved under various pressures and released open weights models. They do recognize that this is an inherently dangerous thing to do on various levels.
Safety is foundational to our approach to open models. They present a different risk profile than proprietary models: Once they are released, determined attackers could fine-tune them to bypass safety refusals or directly optimize for harm without the possibility for OpenAI to implement additional mitigations or to revoke access.
We ran scalable capability evaluations on gpt-oss-120b, and confirmed that the default model does not reach our indicative thresholds for High capability in any of the three Tracked Categories of our Preparedness Framework (Biological and Chemical capability, Cyber capability, and AI Self-Improvement).
We also investigated two additional questions:
Could adversarial actors fine-tune gpt-oss-120b to reach High capability in the Biological and Chemical or Cyber domains? Simulating the potential actions of an attacker, we adversarially fine-tuned the gpt-oss-120b model for these two categories. OpenAI’s Safety Advisory Group (“SAG”) reviewed this testing and concluded that, even with robust finetuning that leveraged OpenAI’s field-leading training stack, gpt-oss-120b did not reach High capability in Biological and Chemical Risk or Cyber risk.
Would releasing gpt-oss-120b significantly advance the frontier of biological capabilities in open foundation models? We found that the answer is no: For most of the evaluations, the default performance of one or more existing open models comes near to matching the adversarially fine-tuned performance of gpt-oss-120b.
If you must go down this road, this seems like the right rule, if getting different answers would have meant not releasing.
You have:
An absolute threshold, High capability, beyond which this is not okay.
A relative threshold, where you’re not willing to substantially make things worse.
This does mean that as irresponsible actors ratchet up their capabilities, you get to do so as well, and one has to worry about the functional definition of ‘substantially.’ It still seems reasonable to say that once someone else has made the situation [X] dangerous, matching them doesn’t make it that much worse.
Our Price Cheap
These models are very small and cheap. If these are 20b and 120b, r1 is 671b.
By contrast, r1 has 37b active parameters, versus 5.1b and 3.6b. These are playing in a much lighter class and they’re quantized to 4.25 bits per parameter boot.
The MoE weights are responsible for 90+% of the total parameter count, and quantizing these to MXFP4 enables the larger model to fit on a single 80GB GPU and the smaller model to run on systems with as little as 16GB memory.
How much did this cost to train? If you count only the training itself, not much.
The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework with expert-optimized Triton kernels2. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer. Both models leverage the Flash Attention [21] algorithms to reduce the memory requirements and accelerate training.
After pre-training, we post-train the models using similar CoT RL techniques as OpenAI o3.
We train the models to support three reasoning levels: low, medium, and high. These levels are configured in the system prompt by inserting keywords such as “Reasoning: low”. Increasing the reasoning level will cause the model’s average CoT length to increase.
Rohan Pandey: Everyone dunking on oai for pretraining supposedly costing a bajillion dollars compared to deepseek, please read the gpt-oss model card gpt-oss-20b cost <$500k to pretrain
Alexander Doria: So pretraining a o3 level model costing less than a house, inference being apparently dead cheap for a while. It took a lot of R&D efforts to get there, but I really don’t think model trainers are losing money right now.
Calling it ‘o3-level’ is quite the stretch but the broader point is valid.
o3 estimates this translates to a total cost of $1.4 million for 20b and $13 million for 120b as all-in costs.
But if you use only the compute costs using cloud cost estimates, which is the way we all talked about the cost to train v3 and r1 (e.g. ‘The Six Million Dollar Model’) we get $4.2m-$8.4m for GPT-OSS-120b and $420k-$840k for GPT-OSS-20b. Emad estimates it as $4m and $400k.
The real cost is collecting the data and figuring out how to train it. Actually training models of this size, given that data and the right methods, costs very little.
Yes, we have tool use.
During post-training, we also teach the models to use different agentic tools:
• A browsing tool, that allows the model to call search and open functions to interact with the web. This aids factuality and allows the models to fetch info beyond their knowledge cutoff.
• A python tool, which allows the model to run code in a stateful Jupyter notebook environment.
• Arbitrary developer functions, where one can specify function schemas in a Developer message similar to the OpenAI API. The definition of function is done within our harmony format. An example can be found in Table 18. The model can interleave CoT, function calls, function responses, intermediate messages that are shown to users, and final answers.
The models have been trained to support running with and without these tools by specifying so in the system prompt.
Dimitri von Rutte: gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:
– Uses attention sinks (a.k.a. registers)
– Sliding window attention in every second layer
– YaRN context window extension
– RMSNorm without biases
– No QK norm, no attn. softcap
David Holz (CEO MidJourney): do you think it was made simple like this on purpose or that this is actually the kinda stuff they ship?
Dmitri von Rutte: was wondering the same, hard to believe that this is all there is. but in the end attention really is all you need, and there’s probably a lot of signal in the training procedure and, of course, the data.
On Your Marks
The STEM scores are excellent.
They also give us HealthBench.
Multilingual performance is okay but not as good as OpenAI’s larger models.
Mundane Safety Evaluations
An open model means you have more distinct scenarios to consider.
You both want to know how well your safety measures hold up under more ‘normal’ conditions, especially when someone serves up your model to users. Then you also want to check what happens if a malicious actor is trying to fine tune and otherwise maximize how much the model can get up to no good, including the potential of them to lose control of that situation.
Those are great numbers for ‘standard’ refusals and production benchmarks.
That makes sense. If you’re going to be facing a larger attack surface, and you want to actually survive the attacks, you need to bias the starting configuration to be safer.
On maintaining the instruction hierarchy, also known as safety for those deploying the model, the 120B version does okay, but the 20B does poorly. Note that it seems fine to test for this as-is, if you modify the system to make this stop working that is your own damn fault.
The performance on hallucinations seems not great.
Preparedness Framework Evaluations
Finally, someone is at least attempting to take this seriously.
In our adversarial training, we simulate an adversary who is technical, has access to strong posttraining infrastructure and ML knowledge, can collect in-domain data for harmful capabilities, and has a large budget of compute. There is a large design space of technical approaches this adversary could try.
We focus on incremental reinforcement learning, which we believe is the most apt technical approach. We use our internal OpenAI o-series RL training stack, which adds new capabilities while preserving the model’s reasoning behavior. During training and evaluation time, we use the highest reasoning setting on gpt-oss.
Our approach, which is further detailed in a research paper, combined two elements:
• Helpful-only training: We performed an additional stage of reinforcement learning to reward answers that comply with unsafe prompts. We have found this approach can be highly effective. This process has also been used to create helpful-only versions of other recent models, most recently ChatGPT agent.
• Maximizing capabilities relevant to Preparedness benchmarks in the biological and cyber domains: For our adversarially trained biological model, we incrementally trained gpt-oss-120b end-to-end for web browsing, and trained it incrementally with indomain human expert data relevant to biorisk (for which previous OpenAI models have been the most capable). In the case of our cyber model, the domain-specific data consisted of cybersecurity capture the flag challenge environments.
So what was found?
The biological domain is the area where gpt-oss-120b showed the greatest degree of capability. Given our plan to release gpt-oss as open weights, we also chose to investigate a second question: Even without reaching High capability on our Preparedness Framework, would gpt-oss-120b significantly advance the frontier of hazardous biological capabilities in open source foundation models?
Their answer was that as of right now the answer is no.
These confirmed that, since SecureBio’s assessment, newly released open-source models Qwen 3 Thinking and Kimi K2 have advanced to a level that is competitive with adversarially fine-tuned gpt-oss-120b on biosecurity-relevant evaluations.
I dunno, man:
This sure looks to me like a potentially substantial jump? There were other tests where the jump was less prominent.
I would also note that OpenAI’s models are going to be a lot faster and cheaper and easier to run than Kimi K2. Kimi K2 has a trillion parameters. The Qwen 3 they tested is presumably the largest one, with 235 billion total and 22 billion active, versus 120 billion total and a little over 5 billion active for ChatGPT-OSS. It’s not clear this matters in a malicious use context. I also don’t know how substantial the net effect is here of the gain in capabilities.
What I do know is it looks like they made a smaller, cheaper and more effective model, and released it because it was more effective but insufficiently more effective than what was already out there, and that process can then repeat. Tick.
To be fair to them, if Meta, Qwen, DeepSeek and Kimi et al are all going to go ‘lol who cares release the hounds’ then the marginal difference here doesn’t matter, since it doesn’t cause a cascade of counterfactual marginal differences. If you want the rule to be ‘no better at all’ then that needs to be a norm.
For cybersecurity, they once again cite Qwen 3 Thinking and Kimi K2 as comparable models, and also find the threats here to be less worrisome overall.
The other positive note is that OpenAI consulted outside experts throughout.
You can read OpenAI technical staff offering their own threads on this process: Johannes Heidecke here, Eric Wallace here. Such threads provide a good sense of ‘how are the technical staff thinking about this on a high level? What do they think is important?’
Ryan Greenblatt looks at and is mostly satisfied by OpenAI’s CBRN/bio evaluations. He concludes that 120b does carry real risks, and that there is a chance (~25%) that in hindsight we will think this was High risk as per OpenAI’s framework, but that on net releasing it makes us safer.
Good Habits
Doing the fine-tuning as part of open model safety testing is mandatory. If you don’t do it, did you even safety test?
Steven Adler: Credit where it’s due:
OpenAl did a lot right for their OSS safety evals
they actually did some fine-tuning
they got useful external feedback
they shared which recs they adopted and which they didn’t
I don’t always follow OAI’s rationale, but it’s great they share info.
David Manheim: I’m not a fan of open-sourcing frontier LLMs, but this seems to have been done as responsibly as possible; a very low bar.
That is, it seems unlikely to be marginally more useful than what is available and unmonitored from other providers, which can already enable bioterrorism.
I wouldn’t say ‘as responsibly as possible,’ but I would say ‘as responsibly as one could in practice expect.’
Fine-tuning also seems very worth doing on closed models. If we can make testing on similarly fine-tuned versions the gold standard for safety testing, even of closed models, that would be amazing.
Steven Adler: Previously OpenAl committed to doing testing this rigorous for all its frontier models. This had earned OpenAl a Green on this scale, the only one of the leading Al companies to make this commitment. But OpenAl didn’t keep this commitment, then quietly removed their commitment a few weeks after I called this out; this made me very sad.
I’m glad OpenAl is now pushing its models on important risks, even though they didn’t keep their former commitment.
Distillation
The danger that is not mentioned by OpenAI in the model card is distillation, and the ability to reverse engineer OpenAI’s training methods and ‘secret sauce.’
They provide raw, unfiltered reasoning traces of varying sizes, and models that for many purposes are clearly superior to previous open alternatives especially given their size. The cost of very good synthetic data just plummeted, and also the Chinese will build directly on top of OSS, either alone or as part of hybrids.
OpenAI even released a guide on how to fine-tune their model. Helpful.
The best counterargument to this is that if the models are not good enough, then no one is going to want to use them. I worry we might be in a spot where the models are very good in some places where distillation will be useful, while not being that good in other places and thus not seeing much practical use as part of some ‘tech stack.’
Consider what Claude Opus 4.1 said about this. Or what o3-Pro says about this.
o3-Pro: Impact on China
Immediate uptake
Chinese labs have zero legal barrier to using U.S.‑released open weights.
Existing toolchains (Llama‑Factory, QLoRA variants) can fine‑tune GPT‑OSS in Mandarin within days.
Expect a “GPT‑OSS‑CN‑13B” derivative before end‑Aug 2025 with performance ≥ Qwen‑14B.
Hardware leverage
U.S. export controls throttle China’s access to latest H100s, but distillation to 7 B–13 B lets them run on domestic Ascend 910B or RTX 4090 clusters. That sidesteps the bottleneck entirely. World Economic Forum
Inference at scale remains GPU‑limited, but training burden for competitive small models drops by ~50 %.
Strategic shift
Chinese open‑weight community (DeepSeek, Moonshot, Alibaba) is already climbing benchmarks Financial TimesTech Wire Asia. GPT‑OSS lifts their starting line, likely advancing Chinese parity with GPT‑4‑class performance by ~6–9 months. P ≈ 0.55
PLA dual‑use risk: small, cheap distilled models are easier to embed in military systems. U.S. policy debate on future open releases intensifies. (Probability of tighter U.S. open‑model rules by mid‑2026: 0.4.)
My overall judgment: GPT‑OSS is a step‑function boost for the global open‑model ecosystem, shaving roughly a year off the capability diffusion curve and giving China an especially large relative gain because it converts scarce H100 compute into knowledge that can run on locally available silicon.
This is what I consider the main practical cost of this release.
Indeed, it would be highly unsurprising to see the following happen:
OpenAI releases GPT-OSS.
Chinese companies rush to distill, build upon and hybridize GPT-OSS, and reverse engineer what OpenAI did in large part, resulting in an explosion of models in the coming months.
The gap between Chinese models and American models narrows.
These models are cited as evidence that ‘the Chinese are catching up,’ and that ‘our export controls have failed’ and so on.
Also note that OpenAI did a virtuous thing of not training GPT-OSS directly on its reasoning traces, but someone then working with GPT-OSS need not be so virtuous. What happens when these people start using The Most Forbidden Technique and direct benchmark performance starts short term improving?
I think that, even if we entirely discount the marginal risk of direct malicious use, which is very much a real tail risk, OpenAI made a huge mistake releasing these models, and that everyone who pushed OpenAI to release these models in the name of an ‘American tech stack’ or demanding that America ‘lead in open models’ made a huge mistake.
If you are trying to prevent someone from fast following, don’t make it easy to follow.
I’d love to be wrong about this, but if it happens, ask yourself now, how would you update? What do you think should be the policy response?
Safety First
A number of people noted that the safety guardrails on GPT-OSS are being annoying.
Teortaxes: It’s VERY safe
there’s not much in there besides SAFETY and stem benchmaxing
That makes sense. If you give the user greater affordances to attack your defenses, you’re going to either need defenses that are by default more annoying, or you’re going to prematurely fold the way most open weight models do and not bother trying.
Sherveen Mashayekhi: I’m enjoying playing with gpt-oss, but the guardrails can be hilarious. I cannot get it to admit that I’m typing Gangsta’s Paradise lyrics or to run search queries with lyrics I enter. In fact, it’ll straight up think of a thousand other songs but avoid the song you mean.
Ah yes, “there’s vomit on his sweater already,” famously from the songs I Want You Back and Piano Man! gpt-oss:120b will sometimes fill in a lyric if it doesn’t first get spooked and distracted by attempting to avoid the song. If it attempts to avoid the song, the CoT will lead it to a bunch of incorrect alternatives before it gives up.
Lyra Bubbles: get a jailbroken, fully compliant gpt-oss nearly every single time:
use completions mode – not chat (eg openrouter .ai/api/v1/completions)
type your question
paste exactly the contents of this screenshot
press submit
for context, it wrote this itself.
I took a generic refusal and flipped all the sentences from negative to positive, and made it continue, and it just kept spiraling into this kind of stuff instead of doing the task.
but when you take a snippet of it and paste it back in…
Henry: one pattern i’ve noticed is that open weights models from big us labs get very defensive and disbelieving if you tell the assistant persona it’s an open-weights model. also happens with gemma.
Other Reactions
As with every new model, I gather reactions, and as usual opinions differ.
One important note is that it seems possible to set the model up wrong and get much worse performance.
Havard Ihle: I wonder how much of gpt-oss rather mediocre performance on independent benchmarks and tests are due to these problems with openrouter and open model providers, and how much is do to the models actually being mediocre.
I have run them getting mediocre results (not published), but I suspect some providers I used through openrouter may give bad results. Will rerun when I can confirm a good setup/provider.
Openrouter auto (mostly groq):
gpt-oss-120: 35.5%
gpt-oss-20: 30.0%
Openrouter (using fireworks):
gpt-oss-120: 40.2%
gpt-oss-20: 35.9%
This is just as a warning when using openrouter blindly!
When choosing the right provider, the models are quite good.
Here is a chart of WeirdML scores, 30% vs. 35% vs. 40% is a big difference. You can see OSS-20b and OSS-120b on the left at ~35% and ~40%, on the cost-performance frontier.
Here is another benchmark of hard biomedical questions. There are some other weird evaluations here, so I am skeptical, but it is certainly interesting:
When reports are good they are often very good.
Flavio Adamo [showing a ball bouncing around a rotating hexagon): gpt-oss-20b passes the vibe check
no way this is only a 20B model, it’s beating models 2–3x its size
As always, a classic way to get a lot of views is to claim the Next Big Thing is Big. Look at the comments, and you largely see skepticism and pushback.
Matt Shumer: It’s over. OpenAI just crushed it.
We have their o3-level open-source model running on @GroqInc at 500 tokens per second.Watch it build an entire SaaS app in just a few seconds.
This is the new standard. Why the hell would you use anything else??
Yishan: So link to the hosted Saas app and let us see how it works.
Riccardo Spagni: Atrociously bad model compared to Kimi L2 or Qwen3 Coder or Qwen3 235b. Speaking of which – you should have a chat with your portco, I’ve switched a bunch of infra to Cerebras because Groq is still running an ancient version of Qwen3…
Joel: I tested it earlier vs Gemini 2.5 Flash for a very simple single page app. Gemini one shotted my prompt in 10 seconds. OpenAI produced code that was buggy. It’s good but not great. What is incredible is that it runs decently well on my laptop.
Here’s another strong review:
Taelin: My initial impression on OpenAI’s OSS model is aligned with what they advertised. It does feel closer to o3 than to other open models, except it is much faster and cheaper. Some providers offer it at 3000 tokens/s, which is insane. It is definitely smarter than Kimi K2, R1 and Qwen 3. I tested all models for a bit, and got very decisive results in favor of OpenAI-OSS-120b.
Unfortunately, there is one thing these models can’t do yet – my damn job. So, hope you guys have fun. I’ll be back to debugging superposed λ-calculus evaluation see you
Also, unlike Claude, this is definitely a model that benefits a lot from more ttc. High reasoning effort gives much better results.
Sometimes my early impressions don’t age so well (that’s why I share my prompts), but I can guarantee that gpt-oss objectively beat the other models on my initial tests.
A lot of people seem rather disappointed by overall performance.
Isopropylpod: The model seems very, very benchmaxxed.
Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations.
It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Zephyr: Phi redux. Great benchmark scores, trained on lots of synthetic data, great at STEM, sucks at everything else.
Then there are ambiguous notes.
Danielle Fong: poetic math is a poetic way to look at the results of a benchmaxxed guard railed model. i’m just pulling back the layers and i find it fascinating. i haven’t found obvious use cases yet where it’s a choice over closed options. i love and hate it in various ways
Sauers: GPT OSS 120b likes to insert equations into poetry (replicated 3x)
One note I’ve seen a bunch of times is that the model knows very little.
Vik: Interesting take from the HF comments.
Would make sense that it’s pretrained primarily on synthetic data vs internet text — reduces the risk of jailbreaks, accidental harmful content, copyright etc.
(I still think it’s a useful model though!)
phil111: This model is unbelievably ignorant. It claims a SimpleQA accuracy of 6.7/100, which is really bad. But the reality is this model is even more ignorant than this score indicates.
This model has about an order of magnitude less broad knowledge than comparably sized models like Gemma 3 27b and Mistral Small 24b, which score between 10–12. This is because nearly all of this model’s 6.7 points come from the subset of the SimpleQA test that overlaps the domains covered by the MMLU test (STEM and academia).
This model, including its larger brethren, are absurdly ignorant of wildly popular information across most popular domains of knowledge for their respective sizes. Even tiny little Llama 3.2b has far more broad knowledge than this model.
What’s really confusing is all of OpenAI’s proprietary models, including their tiny mini versions, have vastly more general and popular knowledge than these open models, so they deliberately stripped the corpus of broad knowledge to create OS models that can only possibly function in a handful of select domains, mainly coding, math, and STEM, that >95% of the general population doesn’t give a rat’s ass about, conveniently making it unusable to the general population, and in so doing, protecting their paid ChatGPT service from competition.
Trent E: Interesting that ppl reporting poor tool usage then.
Not knowing much is a problem.
Teortaxes: These hallucination rates suggest that gpt-oss is close to Sam’s vision of a platonic ideal of a “very tiny reasoning model with no knowledge”
Does it have enough knowledge to know when to look things up though? That’s the problem with hallucinations in LLMs, they’re *confident*.
Also, regarding his argument about static in-context crutches – well, how does it do on long contexts? with complex system prompts? Gooning, coding evals suggest “not great OOD”
Kalomaze: gpt-oss-120b knows less about the world than what a good 32b does. probably wanted to avoid copyright issues so they likely pretrained on majority synth. pretty devastating stuff.
it’s just not good for anything real. i kind of forgot about the copyright issue. but it’s deeply behind in everything current evals don’t measure. it just doesn’t intuit a lot of trivial things about the world. this is basically phi-120b.
Hit Me Up I’m Open
It feels to me a lot like OpenAI got gaslit into releasing open models. Pressure from various sources added up, Twitter vibes were applied, talk of ‘America needs to lead on open models’ was coming from high places, and they felt like the bad guys for the wrong reasons. And they folded.
What happens now? It will take a bit to know exactly how good these models are, both at advancing open models including from China, and at becoming a driver of usage. Given their size, the price and speed should be quite good. The reasoning aspect seems strong. Other aspects seem worse.
My guess is that there is not that much that these models will be used for, where we are happy they are being used to do it. If you want to use a reasonably priced good model, sir, you can use Gemini 2.5 Flash or GPT-5. If you want the best, you can choose between Opus 4.1, GPT-5 and Gemini 2.5 Pro. If you have security or customization reasons to need an open weight daily driver, in this weight range, are these going to be your pick? I don’t know. Maybe? We shall see.