The most important thing about Llama 4 is that the 100K H100s run that was promised got canceled, and its flagship model Behemoth will be a 5e25 FLOPs compute optimal model[1] rather than a ~3e26 FLOPs model that a 100K H100s training system should be able to produce. This is merely 35% more compute than Llama-3-405B from last year, while GPT-4.5, Grok 3 and Gemini 2.5 Pro are probably around 3e26 FLOPs or a bit more. They even explicitly mention that it was trained on 32K GPUs (which must be H100s). Since Behemoth is the flagship model, a bigger model got pushed back to Llama 5, which will only come out much later, possibly not even this year.
In contrast, capabilities of Maverick are unsurprising and prompt no updates. It's merely a 2e24 FLOPs ~7x overtrained model[2], which is 2x less compute than DeepSeek-V3 and 100x less than the recent frontier models, and also it's not a reasoning model for now. So of course it's not very good. If it was very good with this little compute, that would be a feat on the level of Anthropic or DeepSeek, which would be a positive update about Meta's model training competence, but this unexpected thing merely didn't happen, so nothing to see here, what are people even surprised about (except some benchmarking shenanigans).
To the extent Llamas 1-3 were important open weights releases that could be run by normal people locally, Llama 4 does seem disappointing, because there are no small models (in total params), though as Llama 3.2 demonstrated this might change shortly. Even the smallest Scout model still has 109B total params, meaning a 4 bit quantized version might fit on high end consumer hardware, but all the rest is only practical with datacenter hardware.
288B active params, 30T training tokens gives 5.2e25 FLOPs by 6ND. At 1:8 sparsity (2T total tokens, maybe ~250T active params within experts), data efficiency is 3x lower than for a dense model, and for Llama-3-405B the compute optimal amount of data was 40 tokens per param. This means that about 120 tokens per param would be optimal for Behemoth, and in fact it has 104 tokens per active param, so it's not overtrained. ↩︎
17B active params, 22T tokens, which is 2.25e24 FLOPs by 6ND, and 1300 tokens per active param. It's a weird mix of dense and MoE, so the degree of its sparsity probably doesn't map to measurements for pure MoE, but at ~1:23 sparsity (from 400B total params) it might be ~5x less data efficient than dense, predicting ~200 tokens per param compute optimal, meaning 1300 tokens per param give ~7x overtraining. ↩︎
Llama Scout (17B active parameters, 16 experts, 109B total) and Llama Maverick (17B active parameters, 128 experts, 400B total), released on Saturday, look deeply disappointing. They are disappointing on the level of ‘people think they have to be misconfigured to be this bad,’ and people wondering and debating how aggressively the benchmarks were gamed.
This was by far the most negative reaction I have seen to a model release, the opposite of the reaction to Gemini 2.5 Pro. I have seen similarly deeply disappointing and misleading releases, but they were non-American models from labs whose benchmarks and claims we have learned not to take as representing model capabilities.
After this release, I am placing Meta in that category of AI labs whose pronouncements about model capabilities are not to be trusted, that cannot be relied upon to follow industry norms, and which are clearly not on the frontier. Until they show otherwise, they clearly do not belong in the category that includes OpenAI, Anthropic, Google, xAI and DeepSeek.
Table of Contents
Llama We Doing This Again
Meta released the first two Llama 4 models last Saturday, and there is a code change indicating that the original plan was to do it Monday and it got moved up. In general, releasing on a Saturday is such bad strategy it simply isn’t done. Zuck says ‘that’s when it was ready’ but that is not an explanation.
People are wondering why made an exception and did it anyway. I have two hypotheses for what happened (note: I do not have any private information here).
Those two look to be at cross-purposes, but I’m not so sure. Suppose, for the sake of argument here, that Llama-4 sucks.
Who knows. That’s all speculation.
What I do know is that the Llama 4 models released so far seem to not be any good.
You can download Llama 4 Scout and Maverick at Hugging Face or from llama.com. You can try it on the web, or within Meta’s products.
Llama the License Favors Bad Actors
They offer a Llama license, which is rather obnoxious, restricting large companies from using it and requiring rather prominent acknowledgment of Llama’s use, including putting ‘Llama’ in the title and adhering to the ‘acceptable use policy.’
Putting such requirements on otherwise open weight models gives an advantage to overseas companies and governments, especially the PRC, that can and will simply ignore such rules, while handicapping American companies.
European companies are of course handicapped even more, they literally are not given a license at all, blame whoever you want for that part.
Not cool. Be open, or be closed.
This may be part of a consistent pattern. We just saw this story by Allan Smith that Sarah Wynn-Williams, a former Facebook employee, will testify before Congress today that Meta executives undermined U.S. national security and briefed Chinese officials on emerging technologies like artificial intelligence. I don’t know if this is true, but ‘Meta has been cooperating with China for ordinary business reasons’ might be the explanation for a lot of its AI decisions.
If the models were good, this would potentially be a rather big deal.
Llama You Do It This Way
In terms of techniques used, I take their announcement post to be ‘I hear you like mixture-of-expert LLMs and scaling up so I got you some scaled up MoEs to go with your scaled up MoEs.’ This includes the size in parameters and also amount of data.
I would take Meta’s outright statement of ‘newest model suite offering unrivaled speed and efficiency’ as an almost certainly false claim, as is the following quote from them. As in, they are sufficiently false as to downgrade my trust in Meta’s claims, which was never all that high.
That’s a bold claim. Feedback does not back this up.
The two features they do offer are support for 200 languages, and in theory a long context window. I say in theory because it’s easy to offer long context so you can tout it, and hard to make that long context do anything useful and preserve performance. Needle in a haystack is not a good measure of practical use here. Whereas to skip ahead to one private benchmark, Fiction.live, that tries to use that long context, it goes historically bad, the worst they’ve ever seen, even at 60k.
Meta offer some benchmarks, which many noted seem selected, and they also select their competition.
Anyone keeping up with LLM progress can see the choices here are a little suspicious.
Artificial Analysis confirms the scores, but only on the benchmarks Meta chose.
The Llama models are giant mixture of experts (MoE) models, similar to (and presumably because of and copying) DeepSeek’s v3 and r1. Scout is 17B active parameters, 16 experts, 109B total. Maverick is 17B active, 128 experts, 400B total. The unreleased Behemoth is huge, 288B active, 16 experts and 2T total parameters.
That means that while they are optimized to run fast on an H100, they can’t be run at all on a 4090 GPU or other similar consumer hardware, which negates one of the big advantages of open models. I presume you can run Scout and Maverick (quantized) on my Mac Studio, and I might well do that, but that’s a hefty ask.
Llama Fight in the Arena
The strongest data point in Llama 4’s favor is the Arena ranking of 1417. That is good for second place, which is indeed impressive if it is reflective of general performance.
Alas, as we all know by now, Arena is being used as an optimization target. Was that done here? We don’t know.
Other signs like the selective benchmarks they released are suggestive of such a strategy, and they would be far from the only ones. Janus asks what other than Goodharting explains the rise in Arena ratings for new models, I think that’s definitely a lot of it, or for things that aren’t actually Arena but are highly corrected to area.
What does Arena optimize for? A random internet user prefers your response to another model’s response.
What makes people prefer one response to another? We can also look at the actual responses, and see, now that Arena has released answers for review.
These answers by Maverick, that users voted for, seem absurdly obnoxious and bad. I originally wrote ‘these make me want to puke,’ erased it, but now that I see TDM saying the same thing I’m putting that observation back in. This is the opposite of what I want.
And indeed, this also potentially explains Claude Sonnet 3.7’s low Arena ranking. What if people really do prefer syncopathy and lengthy slop? It exists for a reason.
It’s clear Llama-4 fell victim to Goodhart’s Law, either to Arena rankings directly or to a similar other ranking process they used in fine tuning.
We also know that this version of Maverick on Arena is not the same as the one they released, and it seems, shall we say, ‘slopified.’
The question is, is that all that happened? Did they also outright cheat to get this Arena ranking? I opened a Manifold market, unfortunately we likely never know for sure but I figured something was better than nothing here, suggestions for better resolution methods welcome. When I say ‘cheating’ I mean something beyond ‘a version optimized to do well on Arena.’ I mean actual outright cheating.
Llama Would You Cheat on Other Benchmarks
Did they flat out cheat?
There’s at least one claim that they did fix that partly via cheating, obviously take with tons of salt given the sourcing.
I wouldn’t think Meta would go this far, for the same reasons as Peter, so I doubt it happened. Nor would they have had to go this far. You actually have to work hard to not accidentally de facto train on benchmarks when using 22T+ tokens.
So while I’m quoting the post for posterity, I assume this accusation is probably false.
There is however this:
It doesn’t look great. Here is it in an easier to read form:
That sure looks like cheating. Again, it doesn’t mean they intentionally train on the test set. If you have 22T+ tokens and throw the entire internet at your model, there’s going to be contamination. All you have to do is not sufficiently care about not training on benchmarks. Alternatively, you can hill climb on your test scores.
Previously, I would have doubted Meta would let this happen. Now, I have less doubt.
This would not be the first time Meta has broken similar norms.
Those sources seem rather illegal. Meta don’t care. What are you going to do about it?
It is 2025. In general, ‘[X] would goes against norms’ is no longer seen as so strong an argument against doing [X]. The question is now, if I do [X], yes it is against norms, but even if you figure out that I did that, what are you going to do about it?
That goes double for ‘not doing enough to prevent [X] would go against norms.’
Llama It So Bad on Independent Benchmarks
This is everything I could find that plausibly counts as a benchmark. There are some benchmarks where Maverick is mid, others where it is less than mid.
I don’t know if ARC-AGI counts as ‘independent benchmarks’ but Maverick scored 4.38% and Scout 0.5% on ARC-AGI-1 and both got 0.00% on ARC-AGI-2.
On Livebench, Llama 4 Maverick does relatively okay with a 54.38, right behind DeepSeek R1 Distill Llama 70B and Gemini 2.0 Flash.
Here are the Lech Mazur benchmarks.
Extended Word Connections (which is de facto a reasoning benchmark):
Confabulations, it gets a 22.6 here, which is rather not good:
On Creative Writing Llama Maverick bombs super hard, Llama are the three bars on the left:
In the Elimination game, things again don’t go great.
It also does not do well in Thematic Generation or Step-Game Battles where even Llama 3.3 70B kicks its ass, as does almost everything else.
BigCodeBench didn’t go great, although Llama-4-Maverick did marginally beat out Gemma-3-27B.
Markus Zimmerman reports results for DevQualityEval v1.0, and they ‘do not look good,’ they are more than halfway down a very long chart of only open models.
Harvard Ihle is here with WeirdML, Maverick is in the middle, doing pretty well relative to other benchmarks.
In general, if you have your own benchmark, it doesn’t look good:
This was Aider polyglot:
Here’s that positive viewpoint, from xjdr, clearly in the context of open models only, essentially saying that Maverick is a specialized model and is good in particular for agentic and tool calling work and for that purpose it is good:
I suppose that is possible. Perhaps it has its niche and will be good at that niche once people adapt to it and scaffold it well. But that’s definitely not how Meta is presenting Maverick or the future Behemoth.
It’s weird to call it a ‘benchmark’ but worth noting that Llama 4 Scout and Maverick did not exhibit alignment faking in a new test.
Another sort-of benchmark would be red teaming, done here by Virtue AI. Alas, their tests seem to be against mundane risks only. They find that Llama 4 is significantly less compliant with AI regulations than Claude 3.7 or GPT-4.5, ‘lagging behind peers,’ and evaluations show ‘noticeable weaknesses’ against mundane harms, despite what they call ‘Maverick’s caution dilemma’ and false refusals.
That is distinct from asking about misuse, malicious fine-tuning or other sources of potential catastrophic risk from an open weights model – as always, ‘the license says you cannot do that’ is going to get ignored here. One presumes that the main defense is that these models lack the capability to cause new trouble here, at least in the absence of Behemoth.
Llama You Don’t Like It
Or, here is what people are saying in other realms.
I’ve had confirmation of Yair’s statement from other reliable sources.
Who was the biggest cheerleader that doesn’t work at Meta?
Failure to mention Gemma 3 feels like strong mood affectation, on top of the marketing issues. Google is known as a closed lab, Meta is known as open. But mainly yes, Google’s marketing is atrocious. But a claim that Gemma 3 put us back in the lead was a lot more defensible than one about Llama 4.
The Llama tokenizer is a place you might fear to tread.
Jim Fan says ‘Llama-4 doesn’t disappoint’ but his response seems entirely based on Meta’s claims and reports rather than any independent assessment of performance.
All general reports on feedback say that people are disappointed. It was so disappointing that mostly people treated it as a non-event until asked.
Along similar lines, Alexander Doria doesn’t see much point in giving 40T tokens to Llama-4 Scout, and 22T to Llama-4 Maverick.
I don’t think this means model and data size scaling are over. I think it means that if you do not know how to execute, sheer size will not save you, and probably gives you smaller marginal gains than if you executed well.
Llama Should We Care
The big takeaway is that we have to downgrade expectations for Meta in AI, and also our expectations for how much we can trust Meta.
Despite vastly superior resources, Meta now seems to be trying to copy DeepSeek and coming up short. Exactly how short depends on who you ask. And Meta is, to an unknown degree, making a deliberate effort to make its models look good on benchmarks in ways that violate norms.
It is hard to count out a top tech company with tons of compute and almost endless capital. They could still turn this ship around. But they’re going to have to turn this ship around, and do it fast, if they want to be competitive.
Right now, America’s open model champion isn’t Meta. It is Google with Gemma 3, and soon it may also be OpenAI, which is planning an open reasoning model soon. I realize that causes some dissonance, but that’s where we are. Beware mood affectation.