John Pressman: So what stands out to me about [Kimi K2]. Is that it doesn’t do the thing language models normally do where they kind of avoid detail? Like, a human will write about things using specific names and places.
And if you pay close attention to LLM writing they usually avoid this. It’s one of the easiest ways to spot LLM writing. This model emphatically *does not* have this problem. It writes about people and events with the rich detail characteristic of histories and memoirs. Or fictional settings with good worldbuilding.
That comment is spot on. The model is so incredibly specfic as if it wants to cram more facts than words into the text, which is great for text I find engaging to read.
Unfortunately that also means it's extremely happy being hella confabulating all the time, especially when pushed to the boundary of its knowledge. Just (half-)made up a ton of facts—is that what it's like to work with o3‽ Just never being able to trust the model of not making things up? I might've been spoiled by the Claudes, then…
It's been a while since I checked: can any of the frontier models pass the 2-4-6 task for a novel rule yet?
Grok 4 was able to guess my rule of "three rational numbers." Haven't tested out other models yet.
https://grok.com/share/c2hhcmQtMw%3D%3D_748b1b41-eda9-4619-868e-5bb4cb022d50
EDIT: Claude Opus 4 is also able to guess the rule on the first attempt.
https://claude.ai/share/4dcd8fcf-4fcb-4d48-a18f-70c56a9c4be7
While most people focused on Grok, there was another model release that got uniformly high praise: Kimi K2 from Moonshot.ai.
It’s definitely a good model, sir, especially for a cheap-to-run open model.
It is plausibly the best model for creative writing, outright. It is refreshingly different, and opens up various doors through which one can play. And it proves the value of its new architecture.
It is not an overall SoTA frontier model, but it is not trying to be one.
The reasoning model version is coming. Price that in now.
Introducing Kimi K2
Introducing the latest model that matters, Kimi K2.
K2 is based on the Muon optimizer, so it’s a unique offering. There were claims that the method would not scale or would be unstable, Kimi seems to have proven this false.
K2 takes DeepSeek’s extreme mixture of experts (MoE) with 671B total parameters and goes a bit further, taking the total size to 1T.
Despite that size you can get it running on Groq, Teortaxes reports you can get it to 185 tokens/second there at full context, and Aarush Sah says they then made it even faster than that.
Having a Moment
By all accounts Kimi K2 is excellent for its size and cost, and at least competitive with DeepSeek’s v3, with many saying K2 is clearly ahead.
Presumably a reasoning model is coming. Please adjust your expectations (and if desired your stock portfolio) in advance of that event, and do not lose your head if they release an app with it and it gets popular for a time. Remember all the ways in which the DeepSeek Moment was misleading, and also the underreaction to v3. We do not want another massive overreaction to the wrong news.
I also once again warn against saying a release means a lab or country has ‘caught up’ if, at the time of the release, there are some aspects where the model is state of the art. There are those who actively prefer Kimi K2 over other models, even without reference to cost, especially for purposes related to creative writing. I can totally believe that the new method is excellent for that. A remarkable achievement. But keep that achievement in perspective.
Another Nimble Effort
Once again, an impressive result was made on the cheap by a modest team.
My guess is that we primarily don’t do it because we don’t do it, but also because restrictions breed creativity and we don’t have to do it, and because we don’t have the incentive, or especially the felt incentive, to do it.
As in, if you are in China, then building a cheap (to train, and to run) model is on top of a short list of candidates for The Thing You Do in the space. Then you release it, with a basic clean implementation, and let others worry about features. A huge part of the motivation behind releasing these models is national prestige and national competition. Everyone around you is egging you on as is the government. That is a highly asymmetrical motivation.
Whereas in America, you could try to do that, but why would you? If you can do this, you can get a better valuation, and make more money, doing something else. The profit margins on the ultimate offering are very low and usually zero. Your lunch could get eaten by a top lab at any time, since ultimately no one cares what it cost to train the model, and your lunch will expire quickly regardless. If you are one of the cracked engineers that would join such a team, you’ll get a better offer to join a different team doing something else. Even if you got close you’d likely do better getting acqui-hired. There’s no need to skimp on compute.
It will be interesting to see how well OpenAI does when they release an open model.
On Your Marks
Some basic ones:
Lech Mazur put Kimi through his paces. It did lousy on hallucinations, thematic generalization and extended word connections, and downright terribly in the elimination game of social skills. The system isn’t tuned for that sort of thing, but on short-story creative writing it is the new champion.
Harvard Ihle is there with WeirdML, it does well for its price point as a non-reasoning open model, although grok-3-mini (high) is cheaper and scores higher, and r1-0528 keeps the open model high score. But this metric favors reasoning models so there’s a lot of room to improve here by adding reasoning.
This isn’t a benchmark, but it also sort of is one and it’s pretty cool:
Everybody Loves Kimi, Baby
Tyler Cowen simply asked ‘Kimimania?’ and the comments section was generally impressed by its performance.
Okay, Not Quite Everyone
There were only a few places people reported being a bit let down, other than by it not yet being a reasoning model.
Everyone Uses Kimi, Baby
Open and cheap and unique and new and pretty good is a great combination, also note the very low market share here for xAI and also for OpenAI. This isn’t overall market share, it’s in a very specific context, but Kimi is definitely breaking through.
Also this is another case where one should compare cost or compute, not tokens, since different models use radically different amounts of compute and have different orders of magnitude of cost. Anthropic’s share of tokens here represents quite a lot of the compute and dollars spent.
I see exactly why Teortaxes predicted this, yet so far I haven’t seen the reports of shortfalls, although various third-party benchmarks make it clear they are there:
Write Like A Human
It’s edging out o3 at the top there, followed by Opus, R1-old and then Sonnet. R1-0528 is solid but does substantially worse. Here’s EQ-Bench 3:
Given how other models score on these benchmarks, this appears meaningful.
I find ‘coherent’ rather funny as a greatest weakness. But hey.
Here’s the (a little too narrow?) slop test, as in ‘not x, but y.’ Lower is better.
Lech Mazur has it taking the #1 spot over o3, Gemini 2.5 Pro and Claude Opus in Short-Story Creative Writing.
Those are important weaknesses but we’ve definitely reached ‘horse can talk at all’ territory to get to this point.
It’s a strange metric. Gemma-3 does remarkably well and better than Gemini-2.5-Pro.
My instinctive guess is it is less about what data is being used, and more what data is not being used or what training isn’t being done.
Another hypothesis is that the bilingual nature of Chinese models makes them, if not better, at least different, and when you’re used to an ocean of slop different is great.
A lot of what makes a hack writer a hack writer is that they keep doing the same things over and over again, and eventually everyone is in some sense a hack. So having a different writer can be a breath of fresh air even if they are a hack.
You could kind of say that any given author or model, or almost any other form or genre of creative work, has a ‘time to slop,’ before a reader sees the patterns. And different variations use up different amounts of that ‘time to slop’ for others, and the American models all sound the same so they all burn that fuse together.
There is still very much better and worse, some things really are slop and some things really aren’t. I am inclined to believe that Kimi K2 is doing something fundamentally ‘less slop-like,’ but also I am guessing a lot of this is that it is different, not only via being Chinese and culturally different but because it was trained differently, and thus it feels fresh and new.
Right now we have 10,000 outputs, all the same. If can we can instead get 10,000 outputs, all different, perhaps we’d have something.
What Happens Next
We will continue to see what Kimi K2 can do, how best to use it, what its weaknesses are, and how much of its refreshing nature is being better in places versus being different. It is too early, and I haven’t had time with it directly.
Presumably Kimi will use this to create a reasoning model. If they don’t, there’s nothing stopping someone else from doing so instead. So far we’ve seen a remarkable lack of independent reasoning model conversions, but they’re remarkably cheap to do.
We will also see what other labs can do now that this architecture has been proven. What could OpenAI, Google, Meta or xAI do if they copied these methods but used orders of magnitude more compute? If they integrated this into what they already do? If they used this as part of a MoE? I presume we will find out.