No LLM generated, assisted/co-written, or edited work.
Read full explanation
Every government currently funding an AI national champion is funding the wrong thing. Not intentionally, mostly. But functionally. The right thing costs about 1% as much and would work roughly 100x better. It sounds too cheap to be serious, which is why it is underfunded, and serious people in policy have confused "expensive" with "important."
The claim is simple. Countries cannot build frontier models, will not build frontier models, and should stop pretending. What they can do, for a rounding error relative to what they are currently spending, is build the benchmarks that frontier labs optimize against, and then use their regulatory authority to make those benchmarks gate market access. Once you have a good benchmark with teeth, every frontier lab in the world will spend billions of dollars of their own compute improving against it, for free, because that is what frontier labs do. You have effectively conscripted OpenAI, Anthropic, Google DeepMind, Meta, and xAI into working on your priorities. You paid under a million. They paid billions. This has happened already, repeatedly, on benchmarks that cost approximately nothing, and it is happening right now.
Compute sovereignty is a security blanket for policymakers who don't understand the industry. It buys press releases, ribbon-cuttings, and job announcements. It does not buy strategic leverage over what AI becomes. Benchmarks do, and the mechanism by which they do it, procurement and regulatory gating, is the step most "sovereign AI" strategies skip.
A note on compute governance more broadly. The compute-governance literature (Heim and GovAI, Shavit, the Biden-era chip export controls) argues that hardware is the primary regulatory lever. I agree, for the US, and less clearly for China. Hardware controls are not available to everyone else. The French, Canadian, Indian, and Japanese policymakers who cannot restrict Nvidia's customer list are therefore spending on sovereign compute as a consolation prize. This essay is addressed to them.
The bonfire
Let me catalogue the waste. The EU announced €200B under InvestAI. Once you read the Commission's own documents, €150B is a voluntary industry pledge, €20B is earmarked for up to five "AI gigafactories," and about €7B is committed public money mostly rebadged from the existing EuroHPC program. A twenty-nine-to-one ratio between headline number and actual new public euros.
Canada's Sovereign AI Compute Strategy has ballooned to a CAD 2.4B package, with a staggering CAD 890M officially allocated in April 2026 just to design, build, and operate a single national AI supercomputer. India's IndiaAI Mission, March 2024: INR 10,372 crore (about USD 1.25B), with the largest chunk earmarked for subsidised GPU capacity and "indigenous foundation models." Safe and Trusted AI initially got INR 20.46 crore, smaller than the overheads line. By early 2026, execution bottlenecks and procurement delays meant actual utilisation sat near just INR 800 crore, prompting the government to slash the upcoming annual allocation in half.
The UAE has G42 and Falcon. France has Mistral. Japan has ABCI and GENIAC. Saudi Arabia has HUMAIN. Singapore has SEA-LION. South Korea has a KRW 735T plan.
The plan is always the same: buy GPUs, fund a national champion, train a foundation model in your language, declare victory, go home.
This is fighting the previous war, badly. Anthropic's Claude Mythos, confirmed in late March 2026 after a CMS leak and previewed via system card on April 7, is reportedly a sparse MoE in the 10T-total-parameter range. OpenAI's Spud, reportedly near launch, is in the same class, trained on Blackwell chips at 100k-500k H100-days of pretraining compute. Independent cost estimates for runs at this scale are USD 5-15B per model. Ongoing inference infrastructure for a single frontier player costs hundreds of millions per month. OpenAI killed Sora in March 2026 specifically to free GPUs for Spud. That is the frontier.
A one-billion-dollar national champion does not catch that. Nvidia and partners announced a USD 500B US data center push in 2025; Stargate's target is in the same order of magnitude. The US private sector outspends every sovereign AI program in the world combined, every year. Spending 1-2% of what the frontier spends on one single model is not a plan to match the frontier. It is a plan to have a press conference.
The fungibility problem makes it worse. Compute is compute. A GPU cluster in Quebec runs the same CUDA kernels as a GPU cluster in Texas. A "national foundation model" trained on French data is, right now, worse at French than Mythos or GPT-5.4 out of the box, because the frontier labs already scraped every French corpus worth scraping. A billion dollars buys a model that is, today, worse at the target language than the frontier defaults, and the gap widens every training cycle.
What actually moves the frontier
Hold the catalogue of sovereign AI spending in mind, then consider the cost of the benchmarks that have actually shaped frontier development.
MMLU. Dan Hendrycks et al., September 2020. 15,908 multiple-choice questions across 57 subjects, built by grad and undergrad students scraping exam materials off the internet. Construction cost, charitably, tens of thousands of dollars of RA time. GPT-3 scored 43.9%. Today's frontier models are above 92%. Every major launch since 2021, GPT-4, GPT-5, Claude 3 through Opus 4.6, Mythos, Gemini 1.5 through 3.1, Muse Spark, leads its announcement with MMLU or MMMLU. A benchmark a PhD student helped put together is the center of gravity for every trillion-parameter model release, five years running.
HumanEval. Mark Chen et al. at OpenAI, July 2021. 164 hand-written Python problems. GPT-3 scored 0%. By 2025 o1 was at 96.3% pass@1. HumanEval is now saturated. But in the four-year window it defined what "good at code" meant, and every major coding-agent effort oriented around it.
SWE-bench. Princeton, 2023. SWE-bench Verified curated August 2024. OpenAI's own retrospective calls it a strong signal of capability progress that became a standard metric in frontier model releases, and explicitly ties it to the Preparedness Framework, OpenAI's internal process for deciding when a model is dangerous enough to warrant additional scrutiny. A benchmark designed by a few academics is directly shaping when OpenAI pauses its own deployments. Scores went from ~20% in early 2024 to >80% in 2025. Every coding-agent startup pitched SWE-bench to investors.
ARC-AGI. François Chollet, 2019. Sat near zero for four years. o3 scored 75.7% at the $10k compute limit in December 2024, 87.5% at 172x that. Chollet's summary: ARC-AGI-1 took four years to go from 0% with GPT-3 in 2020 to 5% with GPT-4o in 2024, and o3 resets the intuitions. OpenAI's o3-preview event led with ARC-AGI as the headline number.
Humanity's Last Exam. CAIS and Scale AI, late 2024. A $500k prize pool to crowdsource 2,500 expert questions. January 2025: GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%. February 2026: Gemini 3 Pro Preview 37.5%, GPT-5 Pro 31.6%. Less than a year, an order-of-magnitude gain, direct targeting by every frontier lab. Under USD 2M in construction cost, hundreds of millions in redirected R&D.
FrontierMath. Epoch AI, 350 research-level problems, 60+ mathematicians including Tao, Gowers, Borcherds. OpenAI commissioned most of it and now owns exclusive access to the problems and solutions minus a 50-question holdout. This is actually the cleanest evidence for the thesis: FrontierMath was so valuable to OpenAI that they paid to own it. The revealed preference of the industry's largest lab is that benchmarks are worth buying.
The pattern is mechanical. A benchmark costs, at the absolute high end, single-digit millions. The lab-side compute redirected at beating it is three to four orders of magnitude larger. The capability gains accrue to whoever articulated what capability they wanted.
Why this works
The mechanism is simple. Labs compete for prestige and narrative, they need a story on announcement day, and the cheapest story is a number. That number has to come from a benchmark the research community already takes seriously. Labs do not have discretion over which benchmarks get used. They pick from what exists, and they cannot ignore a well-designed, widely-cited one without looking evasive.
What makes this particularly high-leverage for a country is that it does not require any of the things sovereign AI programs currently fail at. You do not need to match lab compute. You do not need state-of-the-art researchers. You do not need Nvidia quota. You need a few hundred domain experts, a coordination team, a legitimate brand, and careful methodology. Thailand cannot out-train Anthropic at Thai. Thailand, with a benchmark co-signed by its top universities, medical board, legal system, and a dozen expert question-writers, can make it embarrassing for Anthropic to launch Mythos-2 without a Thai score. Once that norm exists it stays. Every subsequent training run will include more Thai data because the labs know they will be graded on it. Whoever builds the benchmarks is setting the prompt for a multi-hundred-billion-dollar industry.
This is already happening. OpenAI's GPT-4o system card partnered with external researchers to build evaluations in Amharic, Hausa, Northern Sotho, Swahili, and Yoruba (Uhura-Eval). ARC-Easy-Hausa went from 6.1% with GPT-3.5 Turbo to 71.4% with GPT-4o. OpenAI did not build these because they woke up feeling altruistic about West Africa. They built them because benchmarks existed and external researchers pushed. If Nigeria, Kenya, Ethiopia, or Tanzania had invested USD 10M in rigorous local-language evaluations across medicine, law, civic knowledge, and culture, those benchmarks would be in every subsequent system card. They did not, so they are not.
C-Eval, built by HKUST and Shanghai Jiao Tong academics, is in virtually every Chinese model launch and most Western models' multilingual sections. SEA-HELM, AI Singapore plus Stanford CRFM, covers Filipino, Indonesian, Tamil, Thai, Vietnamese, and is the Southeast Asian standard for open-weight models. Belebele (Meta, 122 languages) is in multiple system cards. Academic artifacts, built cheaply, defining the sign-off criteria for a trillion-dollar industry.
Does benchmark improvement translate into real capability?
Before the how-to, the objection I take most seriously. It is one thing for a benchmark to move a leaderboard, another for it to produce a model that is genuinely more useful for, say, Thai clinicians. The transfer from benchmark score to real-world capability is not automatic. SWE-bench Verified has contamination problems severe enough that OpenAI in 2026 formally recommended labs stop reporting it. Models got better at the benchmark through a mix of genuine capability gains and specific optimization for the evaluation's quirks, and the benchmark stopped being trustworthy as a frontier signal.
This is the classic Goodhart worry, and it is correct as far as it goes. Two things to say about it.
First, the transfer is imperfect but not fictional. Models that saturated HumanEval are not just better at HumanEval-style problems; they are the models now writing production code at Stripe, Shopify, and GitHub. MMLU saturated, and the models that saturated it are in fact dramatically more knowledgeable than GPT-3. The complaint that benchmarks eventually stop discriminating between frontier models is correct and is not the same as the complaint that they do not drive real capability. A sovereign benchmark does not have to produce a perfect medical model. It has to produce a model that is significantly better than the model that would have existed without it. That bar is achievable and has been cleared across the examples above.
Second, benchmark quality is the lever. Contamination-resistant design, private held-out sets, rotating question pools, adversarial filtering, expert authorship over scraped material: these are the differences between a benchmark that produces real capability gains and one that produces Goodharted noise. A sovereign program that takes this seriously is strictly better than one that does not, and this is where government-backed institutional rigor has a genuine advantage over one-shot academic releases.
The honest version of the thesis is: a good sovereign benchmark buys a meaningful capability improvement in your domain plus a seat at the table for how that capability is evaluated. Not a perfect model. A better one, and a voice.
What a sovereign benchmark strategy actually looks like
If I were running AI policy for France, Canada, India, or anyone sitting on a billion-dollar pot wondering what to do, the plan is roughly this.
Pick three or four domains where national sovereignty actually means something. For most countries: the national language and its cultural context, the legal system, the medical system and public health standards, and extreme-risk safety and national security. Ignore the general-purpose frontier. You will not beat labs running 10T-parameter sparse MoEs on hundreds of thousands of Blackwells.
Build hard, contamination-resistant, prestigious, legitimate benchmarks. Prestige is the word that matters most here. A benchmark that does not get adopted does not steer anything, and adoption depends on brand, methodology, and enough difficulty that labs find it useful as a differentiator. Private held-out sets, otherwise it gets gamed. Paid domain experts. Co-signed by respected institutions: national academies, courts, medical boards, ministries of education. Launch with a splashy results paper evaluating every frontier model so the labs feel immediate pressure. Private test set via API, public leaderboard. HLE and FrontierMath are the reference architecture.
Commit to maintenance. Benchmarks rot. MMLU saturated. SWE-bench Verified has the contamination problems noted above. A sovereign benchmark program needs a decade-long institutional home with committed funding to refresh questions, audit contamination, and release harder versions. This is the kind of slow, unglamorous institution-building that the Science of Evals agenda has been calling for. Government is unusually good at it, private labs have no incentive to do it, and this is precisely where sovereign effort should go.
The mechanism: procurement as forcing function
This is where sovereignty gets real, and where most writing on this topic goes quiet. A benchmark that exists on a leaderboard is a curiosity. A benchmark that gates market access is a standard. The leverage comes from the second, not the first.
Consider the French case concretely. The Haute Autorité de Santé already certifies medical devices and, increasingly, medical software. Suppose HAS, in partnership with a consortium of CHU hospitals and the French medical academies, commissions a French-language medical benchmark: 3,000 questions across diagnostic reasoning, French clinical guidelines, French pharmacological nomenclature, and French medico-legal context. The benchmark is built by paid clinicians, held out privately, refreshed annually. Scores published by API against every frontier model on release.
That is step one, and on its own it is valuable but limited. Step two is the one that creates leverage: HAS issues a regulatory opinion that any AI system marketed for clinical decision support in France must clear a specified score on the benchmark, with the threshold updated every two years as frontier capability advances. The CNAM (national health insurance) will not reimburse software-assisted diagnosis that has not been evaluated. Hospital procurement frameworks reference the score.
Now the incentives invert. OpenAI and Anthropic want the French clinical market. The French clinical market is gated by the benchmark. Therefore their next training run prioritizes French medical capability. The benchmark has become a forcing function on multi-billion-dollar training budgets, at a construction cost of perhaps USD 10-20M and an ongoing maintenance cost of USD 2-3M per year.
The EU AI Act gestures at this architecture through its codes of practice, but the benchmarks those codes point at are the wrong ones: general English-language capability evaluations that are already saturating. A country serious about sovereignty uses its regulatory authority to bind labs to the benchmarks it actually cares about, in its language, on its priorities. Germany could do this for civil law. Japan could do this for pharmaceutical approvals. India could do this for Indian-language customer service in financial services, a market large enough that no lab can ignore it. Brazil could do this for Portuguese-language administrative law. Indonesia could do this for Bahasa-language Islamic finance compliance. Each of these is a market large enough and regulated enough that gating it on a benchmark score redirects frontier training runs.
Without the regulatory step, the strategy is "build benchmarks and hope." With it, the strategy is "we control access to a consequential market, and here is how access is priced." This is the only version that scales.
Total program cost, including the regulatory apparatus: less per year than a single 100MW sovereign training cluster. Strategic leverage: higher by roughly two orders of magnitude.
The objections
Goodhart and saturation. Every benchmark I cited eventually saturated or got contaminated. MMLU is now useless as a frontier discriminator. SWE-bench Verified has the contamination problem. ImageNet stopped differentiating models in 2017. The recent LW post "We're actually running out of benchmarks to upper bound AI capabilities" makes the case that this is now happening for frontier safety evaluations too. From a safety-upper-bounding perspective this is a crisis. From a sovereign-capability-steering perspective, it is not an argument against the strategy; it is the strategy. The reason MMLU stopped being useful is that the labs successfully optimized for the capability MMLU was designed to measure. That is the point. Plan for a 5-7 year useful life per benchmark and continuous replacement. The saturation problem is the program working as intended.
Sovereignty means jurisdictional control, not capability. A French benchmark does not put weights in France and does not stop the US government from leaning on American labs. True, and this is the only real case for any sovereign compute. But sovereign compute without sovereign capability is just expensive hardware, and sovereign capability, in the sense of being able to shape what models do in your jurisdiction, is what benchmarks plus procurement deliver. Some sovereign serving infrastructure makes sense for the national-security tail. A sovereign frontier lab does not.
Selection bias. Canonical benchmarks tend to come from a small set of prestige institutions, mostly the Hendrycks/Steinhardt/Chollet extended network and the big US labs. A Thai or Nigerian academic without that branding has a harder time getting picked up. True but not decisive. C-Eval, SEA-HELM, IndicXTREME, AraBench, and Nejumi all got meaningful adoption on significantly less than world-class branding. The prestige barrier is lower than it looks if you get the rigor, quality, maintenance, and launch right. Countries have all of these capabilities. They just do not deploy them on benchmarks because benchmarks do not make for good photo ops.
The existing ecosystem is crowded. Epoch AI, METR, Scale's SEAL, Apollo Research, MLCommons, the AISIs, and NIST-led CAISI are already doing serious eval work. There is room for sovereign entrants because the existing ecosystem is almost entirely Anglophone, US-based, and focused on frontier capability and catastrophic risk. Almost none of them have procurement authority. A sovereign benchmark program with the ability to gate market access is a different instrument from a leaderboard.
The ask
Minimum viable sovereign benchmark program for any middle power: USD 100M over five years. A dedicated institute with procurement authority. Formal partnership with one or two anchor universities. Legal mandate to work with medical boards, bar associations, exam authorities. Three flagship evaluations published within 36 months. Less than 5% of what Canada and India are currently spending on compute. The strategic impact is higher.
Countries cannot out-spend the labs. They can tell the labs what to work on, but only if they have a legitimate voice, and a legitimate voice means benchmarks, standards, and the regulatory willingness to use them. A million-dollar benchmark that moves a ten-billion-dollar training run, backed by a procurement rule that moves market access, is the highest-leverage policy instrument in AI today. The instrument is boring, institutional, and a decade late relative to its importance. That is why it is so underpriced, and that is the opportunity.
Sovereign AI strategies that skip this and go straight to GPUs and national champions are, to put it bluntly, an expensive way to feel sovereign while outsourcing the actual selection of what AI becomes to someone else.
Every government currently funding an AI national champion is funding the wrong thing. Not intentionally, mostly. But functionally. The right thing costs about 1% as much and would work roughly 100x better. It sounds too cheap to be serious, which is why it is underfunded, and serious people in policy have confused "expensive" with "important."
The claim is simple. Countries cannot build frontier models, will not build frontier models, and should stop pretending. What they can do, for a rounding error relative to what they are currently spending, is build the benchmarks that frontier labs optimize against, and then use their regulatory authority to make those benchmarks gate market access. Once you have a good benchmark with teeth, every frontier lab in the world will spend billions of dollars of their own compute improving against it, for free, because that is what frontier labs do. You have effectively conscripted OpenAI, Anthropic, Google DeepMind, Meta, and xAI into working on your priorities. You paid under a million. They paid billions. This has happened already, repeatedly, on benchmarks that cost approximately nothing, and it is happening right now.
Compute sovereignty is a security blanket for policymakers who don't understand the industry. It buys press releases, ribbon-cuttings, and job announcements. It does not buy strategic leverage over what AI becomes. Benchmarks do, and the mechanism by which they do it, procurement and regulatory gating, is the step most "sovereign AI" strategies skip.
A note on compute governance more broadly. The compute-governance literature (Heim and GovAI, Shavit, the Biden-era chip export controls) argues that hardware is the primary regulatory lever. I agree, for the US, and less clearly for China. Hardware controls are not available to everyone else. The French, Canadian, Indian, and Japanese policymakers who cannot restrict Nvidia's customer list are therefore spending on sovereign compute as a consolation prize. This essay is addressed to them.
The bonfire
Let me catalogue the waste. The EU announced €200B under InvestAI. Once you read the Commission's own documents, €150B is a voluntary industry pledge, €20B is earmarked for up to five "AI gigafactories," and about €7B is committed public money mostly rebadged from the existing EuroHPC program. A twenty-nine-to-one ratio between headline number and actual new public euros.
Canada's Sovereign AI Compute Strategy has ballooned to a CAD 2.4B package, with a staggering CAD 890M officially allocated in April 2026 just to design, build, and operate a single national AI supercomputer. India's IndiaAI Mission, March 2024: INR 10,372 crore (about USD 1.25B), with the largest chunk earmarked for subsidised GPU capacity and "indigenous foundation models." Safe and Trusted AI initially got INR 20.46 crore, smaller than the overheads line. By early 2026, execution bottlenecks and procurement delays meant actual utilisation sat near just INR 800 crore, prompting the government to slash the upcoming annual allocation in half.
The UAE has G42 and Falcon. France has Mistral. Japan has ABCI and GENIAC. Saudi Arabia has HUMAIN. Singapore has SEA-LION. South Korea has a KRW 735T plan.
The plan is always the same: buy GPUs, fund a national champion, train a foundation model in your language, declare victory, go home.
This is fighting the previous war, badly. Anthropic's Claude Mythos, confirmed in late March 2026 after a CMS leak and previewed via system card on April 7, is reportedly a sparse MoE in the 10T-total-parameter range. OpenAI's Spud, reportedly near launch, is in the same class, trained on Blackwell chips at 100k-500k H100-days of pretraining compute. Independent cost estimates for runs at this scale are USD 5-15B per model. Ongoing inference infrastructure for a single frontier player costs hundreds of millions per month. OpenAI killed Sora in March 2026 specifically to free GPUs for Spud. That is the frontier.
A one-billion-dollar national champion does not catch that. Nvidia and partners announced a USD 500B US data center push in 2025; Stargate's target is in the same order of magnitude. The US private sector outspends every sovereign AI program in the world combined, every year. Spending 1-2% of what the frontier spends on one single model is not a plan to match the frontier. It is a plan to have a press conference.
The fungibility problem makes it worse. Compute is compute. A GPU cluster in Quebec runs the same CUDA kernels as a GPU cluster in Texas. A "national foundation model" trained on French data is, right now, worse at French than Mythos or GPT-5.4 out of the box, because the frontier labs already scraped every French corpus worth scraping. A billion dollars buys a model that is, today, worse at the target language than the frontier defaults, and the gap widens every training cycle.
What actually moves the frontier
Hold the catalogue of sovereign AI spending in mind, then consider the cost of the benchmarks that have actually shaped frontier development.
MMLU. Dan Hendrycks et al., September 2020. 15,908 multiple-choice questions across 57 subjects, built by grad and undergrad students scraping exam materials off the internet. Construction cost, charitably, tens of thousands of dollars of RA time. GPT-3 scored 43.9%. Today's frontier models are above 92%. Every major launch since 2021, GPT-4, GPT-5, Claude 3 through Opus 4.6, Mythos, Gemini 1.5 through 3.1, Muse Spark, leads its announcement with MMLU or MMMLU. A benchmark a PhD student helped put together is the center of gravity for every trillion-parameter model release, five years running.
HumanEval. Mark Chen et al. at OpenAI, July 2021. 164 hand-written Python problems. GPT-3 scored 0%. By 2025 o1 was at 96.3% pass@1. HumanEval is now saturated. But in the four-year window it defined what "good at code" meant, and every major coding-agent effort oriented around it.
SWE-bench. Princeton, 2023. SWE-bench Verified curated August 2024. OpenAI's own retrospective calls it a strong signal of capability progress that became a standard metric in frontier model releases, and explicitly ties it to the Preparedness Framework, OpenAI's internal process for deciding when a model is dangerous enough to warrant additional scrutiny. A benchmark designed by a few academics is directly shaping when OpenAI pauses its own deployments. Scores went from ~20% in early 2024 to >80% in 2025. Every coding-agent startup pitched SWE-bench to investors.
ARC-AGI. François Chollet, 2019. Sat near zero for four years. o3 scored 75.7% at the $10k compute limit in December 2024, 87.5% at 172x that. Chollet's summary: ARC-AGI-1 took four years to go from 0% with GPT-3 in 2020 to 5% with GPT-4o in 2024, and o3 resets the intuitions. OpenAI's o3-preview event led with ARC-AGI as the headline number.
Humanity's Last Exam. CAIS and Scale AI, late 2024. A $500k prize pool to crowdsource 2,500 expert questions. January 2025: GPT-4o at 2.7%, Claude 3.5 Sonnet at 4.1%. February 2026: Gemini 3 Pro Preview 37.5%, GPT-5 Pro 31.6%. Less than a year, an order-of-magnitude gain, direct targeting by every frontier lab. Under USD 2M in construction cost, hundreds of millions in redirected R&D.
FrontierMath. Epoch AI, 350 research-level problems, 60+ mathematicians including Tao, Gowers, Borcherds. OpenAI commissioned most of it and now owns exclusive access to the problems and solutions minus a 50-question holdout. This is actually the cleanest evidence for the thesis: FrontierMath was so valuable to OpenAI that they paid to own it. The revealed preference of the industry's largest lab is that benchmarks are worth buying.
The pattern is mechanical. A benchmark costs, at the absolute high end, single-digit millions. The lab-side compute redirected at beating it is three to four orders of magnitude larger. The capability gains accrue to whoever articulated what capability they wanted.
Why this works
The mechanism is simple. Labs compete for prestige and narrative, they need a story on announcement day, and the cheapest story is a number. That number has to come from a benchmark the research community already takes seriously. Labs do not have discretion over which benchmarks get used. They pick from what exists, and they cannot ignore a well-designed, widely-cited one without looking evasive.
What makes this particularly high-leverage for a country is that it does not require any of the things sovereign AI programs currently fail at. You do not need to match lab compute. You do not need state-of-the-art researchers. You do not need Nvidia quota. You need a few hundred domain experts, a coordination team, a legitimate brand, and careful methodology. Thailand cannot out-train Anthropic at Thai. Thailand, with a benchmark co-signed by its top universities, medical board, legal system, and a dozen expert question-writers, can make it embarrassing for Anthropic to launch Mythos-2 without a Thai score. Once that norm exists it stays. Every subsequent training run will include more Thai data because the labs know they will be graded on it. Whoever builds the benchmarks is setting the prompt for a multi-hundred-billion-dollar industry.
This is already happening. OpenAI's GPT-4o system card partnered with external researchers to build evaluations in Amharic, Hausa, Northern Sotho, Swahili, and Yoruba (Uhura-Eval). ARC-Easy-Hausa went from 6.1% with GPT-3.5 Turbo to 71.4% with GPT-4o. OpenAI did not build these because they woke up feeling altruistic about West Africa. They built them because benchmarks existed and external researchers pushed. If Nigeria, Kenya, Ethiopia, or Tanzania had invested USD 10M in rigorous local-language evaluations across medicine, law, civic knowledge, and culture, those benchmarks would be in every subsequent system card. They did not, so they are not.
C-Eval, built by HKUST and Shanghai Jiao Tong academics, is in virtually every Chinese model launch and most Western models' multilingual sections. SEA-HELM, AI Singapore plus Stanford CRFM, covers Filipino, Indonesian, Tamil, Thai, Vietnamese, and is the Southeast Asian standard for open-weight models. Belebele (Meta, 122 languages) is in multiple system cards. Academic artifacts, built cheaply, defining the sign-off criteria for a trillion-dollar industry.
Does benchmark improvement translate into real capability?
Before the how-to, the objection I take most seriously. It is one thing for a benchmark to move a leaderboard, another for it to produce a model that is genuinely more useful for, say, Thai clinicians. The transfer from benchmark score to real-world capability is not automatic. SWE-bench Verified has contamination problems severe enough that OpenAI in 2026 formally recommended labs stop reporting it. Models got better at the benchmark through a mix of genuine capability gains and specific optimization for the evaluation's quirks, and the benchmark stopped being trustworthy as a frontier signal.
This is the classic Goodhart worry, and it is correct as far as it goes. Two things to say about it.
First, the transfer is imperfect but not fictional. Models that saturated HumanEval are not just better at HumanEval-style problems; they are the models now writing production code at Stripe, Shopify, and GitHub. MMLU saturated, and the models that saturated it are in fact dramatically more knowledgeable than GPT-3. The complaint that benchmarks eventually stop discriminating between frontier models is correct and is not the same as the complaint that they do not drive real capability. A sovereign benchmark does not have to produce a perfect medical model. It has to produce a model that is significantly better than the model that would have existed without it. That bar is achievable and has been cleared across the examples above.
Second, benchmark quality is the lever. Contamination-resistant design, private held-out sets, rotating question pools, adversarial filtering, expert authorship over scraped material: these are the differences between a benchmark that produces real capability gains and one that produces Goodharted noise. A sovereign program that takes this seriously is strictly better than one that does not, and this is where government-backed institutional rigor has a genuine advantage over one-shot academic releases.
The honest version of the thesis is: a good sovereign benchmark buys a meaningful capability improvement in your domain plus a seat at the table for how that capability is evaluated. Not a perfect model. A better one, and a voice.
What a sovereign benchmark strategy actually looks like
If I were running AI policy for France, Canada, India, or anyone sitting on a billion-dollar pot wondering what to do, the plan is roughly this.
Pick three or four domains where national sovereignty actually means something. For most countries: the national language and its cultural context, the legal system, the medical system and public health standards, and extreme-risk safety and national security. Ignore the general-purpose frontier. You will not beat labs running 10T-parameter sparse MoEs on hundreds of thousands of Blackwells.
Build hard, contamination-resistant, prestigious, legitimate benchmarks. Prestige is the word that matters most here. A benchmark that does not get adopted does not steer anything, and adoption depends on brand, methodology, and enough difficulty that labs find it useful as a differentiator. Private held-out sets, otherwise it gets gamed. Paid domain experts. Co-signed by respected institutions: national academies, courts, medical boards, ministries of education. Launch with a splashy results paper evaluating every frontier model so the labs feel immediate pressure. Private test set via API, public leaderboard. HLE and FrontierMath are the reference architecture.
Commit to maintenance. Benchmarks rot. MMLU saturated. SWE-bench Verified has the contamination problems noted above. A sovereign benchmark program needs a decade-long institutional home with committed funding to refresh questions, audit contamination, and release harder versions. This is the kind of slow, unglamorous institution-building that the Science of Evals agenda has been calling for. Government is unusually good at it, private labs have no incentive to do it, and this is precisely where sovereign effort should go.
The mechanism: procurement as forcing function
This is where sovereignty gets real, and where most writing on this topic goes quiet. A benchmark that exists on a leaderboard is a curiosity. A benchmark that gates market access is a standard. The leverage comes from the second, not the first.
Consider the French case concretely. The Haute Autorité de Santé already certifies medical devices and, increasingly, medical software. Suppose HAS, in partnership with a consortium of CHU hospitals and the French medical academies, commissions a French-language medical benchmark: 3,000 questions across diagnostic reasoning, French clinical guidelines, French pharmacological nomenclature, and French medico-legal context. The benchmark is built by paid clinicians, held out privately, refreshed annually. Scores published by API against every frontier model on release.
That is step one, and on its own it is valuable but limited. Step two is the one that creates leverage: HAS issues a regulatory opinion that any AI system marketed for clinical decision support in France must clear a specified score on the benchmark, with the threshold updated every two years as frontier capability advances. The CNAM (national health insurance) will not reimburse software-assisted diagnosis that has not been evaluated. Hospital procurement frameworks reference the score.
Now the incentives invert. OpenAI and Anthropic want the French clinical market. The French clinical market is gated by the benchmark. Therefore their next training run prioritizes French medical capability. The benchmark has become a forcing function on multi-billion-dollar training budgets, at a construction cost of perhaps USD 10-20M and an ongoing maintenance cost of USD 2-3M per year.
The EU AI Act gestures at this architecture through its codes of practice, but the benchmarks those codes point at are the wrong ones: general English-language capability evaluations that are already saturating. A country serious about sovereignty uses its regulatory authority to bind labs to the benchmarks it actually cares about, in its language, on its priorities. Germany could do this for civil law. Japan could do this for pharmaceutical approvals. India could do this for Indian-language customer service in financial services, a market large enough that no lab can ignore it. Brazil could do this for Portuguese-language administrative law. Indonesia could do this for Bahasa-language Islamic finance compliance. Each of these is a market large enough and regulated enough that gating it on a benchmark score redirects frontier training runs.
Without the regulatory step, the strategy is "build benchmarks and hope." With it, the strategy is "we control access to a consequential market, and here is how access is priced." This is the only version that scales.
Total program cost, including the regulatory apparatus: less per year than a single 100MW sovereign training cluster. Strategic leverage: higher by roughly two orders of magnitude.
The objections
Goodhart and saturation. Every benchmark I cited eventually saturated or got contaminated. MMLU is now useless as a frontier discriminator. SWE-bench Verified has the contamination problem. ImageNet stopped differentiating models in 2017. The recent LW post "We're actually running out of benchmarks to upper bound AI capabilities" makes the case that this is now happening for frontier safety evaluations too. From a safety-upper-bounding perspective this is a crisis. From a sovereign-capability-steering perspective, it is not an argument against the strategy; it is the strategy. The reason MMLU stopped being useful is that the labs successfully optimized for the capability MMLU was designed to measure. That is the point. Plan for a 5-7 year useful life per benchmark and continuous replacement. The saturation problem is the program working as intended.
Sovereignty means jurisdictional control, not capability. A French benchmark does not put weights in France and does not stop the US government from leaning on American labs. True, and this is the only real case for any sovereign compute. But sovereign compute without sovereign capability is just expensive hardware, and sovereign capability, in the sense of being able to shape what models do in your jurisdiction, is what benchmarks plus procurement deliver. Some sovereign serving infrastructure makes sense for the national-security tail. A sovereign frontier lab does not.
Selection bias. Canonical benchmarks tend to come from a small set of prestige institutions, mostly the Hendrycks/Steinhardt/Chollet extended network and the big US labs. A Thai or Nigerian academic without that branding has a harder time getting picked up. True but not decisive. C-Eval, SEA-HELM, IndicXTREME, AraBench, and Nejumi all got meaningful adoption on significantly less than world-class branding. The prestige barrier is lower than it looks if you get the rigor, quality, maintenance, and launch right. Countries have all of these capabilities. They just do not deploy them on benchmarks because benchmarks do not make for good photo ops.
The existing ecosystem is crowded. Epoch AI, METR, Scale's SEAL, Apollo Research, MLCommons, the AISIs, and NIST-led CAISI are already doing serious eval work. There is room for sovereign entrants because the existing ecosystem is almost entirely Anglophone, US-based, and focused on frontier capability and catastrophic risk. Almost none of them have procurement authority. A sovereign benchmark program with the ability to gate market access is a different instrument from a leaderboard.
The ask
Minimum viable sovereign benchmark program for any middle power: USD 100M over five years. A dedicated institute with procurement authority. Formal partnership with one or two anchor universities. Legal mandate to work with medical boards, bar associations, exam authorities. Three flagship evaluations published within 36 months. Less than 5% of what Canada and India are currently spending on compute. The strategic impact is higher.
Countries cannot out-spend the labs. They can tell the labs what to work on, but only if they have a legitimate voice, and a legitimate voice means benchmarks, standards, and the regulatory willingness to use them. A million-dollar benchmark that moves a ten-billion-dollar training run, backed by a procurement rule that moves market access, is the highest-leverage policy instrument in AI today. The instrument is boring, institutional, and a decade late relative to its importance. That is why it is so underpriced, and that is the opportunity.
Sovereign AI strategies that skip this and go straight to GPUs and national champions are, to put it bluntly, an expensive way to feel sovereign while outsourcing the actual selection of what AI becomes to someone else.