TL;DR

Independent evaluations are both possible and valuable. Our goal is to widen the conversation on decentralized, reproducible, context-aware evaluations as public infrastructure for AI oversight, especially in regions and languages that frontier work often overlooks.

Our recommendations (based on what actually worked for us):

Treat evaluation like an experiment, not a leaderboard. Pre-decide your comparisons, report uncertainty, and document configs so others can replicate.
Use repetitions when decoding stochastically. Two runs already cut most run-to-run noise; a third mainly tightens error bars.
Aim for cost-aware rigor. Understanding how your setup behaves (what changes outcomes, what doesn’t) lets you get high-quality results without high costs or extra complexity—which is exactly why the “experiment” mindset matters.

If more small teams adopt this approach—local languages, transparent methods, light but reliable stats—we’ll get a healthier, more trustworthy evaluation ecosystem.

Why independent evaluations?

Most well-known benchmarks are built and interpreted by a small number of well-resourced actors—big labs, elite universities, and a few private platforms. That concentration helped the field move fast, but it also created blind spots.

What centralization gets wrong (in practice):

Narrow scope and linguistic bias. Benchmarks skew English-first. Capabilities that look strong in English are often treated as universally strong, while performance in other languages remains underexplored.
Incentives that don’t always line up with truth. When the same organizations build models and design the benchmarks, it’s easy (consciously or not) to optimize for the metric rather than for real-world reliability.
Opacity and weak reproducibility. Many influential datasets, leaderboards, and scoring pipelines aren’t fully open. That makes independent verification hard and slows down error-correction.
High barriers to entry. Running evaluations at scale typically requires compute, engineering time, and institutional support that many public-interest or regional groups don’t have.

We don’t think centralized efforts are “bad.” We think they’re insufficient on their own—especially as models show up in places where failure matters (education, healthcare, legal workflows, public services). At that point, evaluation stops being just a research nicety and becomes a governance question: who gets to define “good enough,” for whom, and under what conditions?

Why decentralized efforts matter (and work):

They surface local realities—languages, domains, and error modes that are invisible to English-centric tests.
They create checks and balances: independent pipelines make it harder to game results and easier to spot overfitting to a single benchmark.
They lower dependency risk by keeping data, prompts, and scoring scripts open and reproducible.
They invite participation from universities, meetups, nonprofits, and small labs that can contribute targeted signal—even with modest budgets.

Our own experience with AI4Math and our follow-up on repetitions convinced us that small teams can produce useful evaluations if they keep things simple and scientific: make problems that actually reflect your context; publish your configs; and treat the exercise like an experiment, not a leaderboard race. That approach lets us give cost-aware guidance without needing huge infrastructure.

This isn’t just for research groups. If you’re a practitioner deciding whether to deploy an LLM in a classroom, a clinic, or a back-office process, you need task-specific, context-specific evidence. A lightweight, experimental evaluation—designed around your real use case—can tell you far more than a global leaderboard ever will. Done right, it becomes standard operating procedure: define the task, test transparently, report uncertainty, and only then make the call.

Bottom line: Independent evaluations are both possible and valuable. They widen the lens, strengthen trust, and help the community converge on methods that travel: transparent setups, reproducible runs, and attention to the contexts where these systems actually live.

What Have We Built?

If centralized, English-first evaluations miss important contexts, the remedy is simple: build local, transparent, reproducible setups that others can copy. That’s what we did, first by creating a Spanish-native dataset, then by defining a clear comparison setup, and finally by running the whole thing as an experiment to learn how many repetitions you actually need.

A Spanish-native, university-level math dataset
We authored a 105-problem set across seven domains (Álgebra, Cálculo, Geometría, Probabilidad, Teoría de Números, Combinatoria, Lógica). Each problem has a unique final answer and a step-by-step human solution. The goal: items that are clear, auditable, and relevant to our context. See more at: https://arxiv.org/abs/2505.18978
A clear, minimal evaluation scenario
We compared models in a small, pre-defined set of configurations while holding prompts, decoding settings, and scoring rules constant. This keeps comparisons legible and easy to rerun.
Experimental framing (how many runs do you really need?)
Instead of a one-off leaderboard, we measured how results move across runs and picked the minimal repetition count that keeps findings reliable. In other words, we evaluated our evaluation and used that evidence to recommend a cost-aware default for repetitions. See more at: https://arxiv.org/abs/2509.24086

Our Findings

Here are the practical recommendations we now follow (and suggest to others). They’re written so you can consider implementing them directly if you’re doing something similar.

Calibrate repetitions before you scale. First run a small pilot to see how much results jump between runs, then set your repetition count from evidence. In our setup, single-run leaderboards were shaky; averaging two runs removed most of the rank wobble, and a third pass mainly tightened error bars by a little. That was our result; your task, models, and decoding may behave differently—so measure first, then spend.
Lead with uncertainty, not just a sorted table. Even when some ranks moved across runs, our pre-specified “who beats whom” decisions didn’t change. Share point estimates with uncertainty and make claims at that level.
Keep the setup simple and fixed. Pick a small set of clear conditions (e.g., language × with/without step-by-step) and hold prompts, decoding, and scoring rules constant. It makes comparisons legible and easy to rerun.
Stress-test the right skills. In our math benchmark, Geometry, Combinatorics, and Probability stayed tough across models—good places to probe limits. Adapt this idea to the hard parts of your task.
When models are close, don’t overclaim. Treat near-ties as “similar for now,” and decide with task fit, cost, and deployment constraints in mind. Our results reinforce that ranks can shift while substantive comparisons remain steady.
Document what others need to rerun you. Save and share seeds, prompts, decoding settings, and the scoring script so your results can be checked or extended.

How Much is “Affordable”?

So far, we have shown that it’s feasible for small teams to run independent LLM evaluations. Whether you should invest in producing these evaluations should be decided case by case. That said, running these “experiments” to assess LLM capabilities can be generalized across multiple fields. We therefore want to share roughly how much we invested, so you can factor this into planning your own evaluations.

For the AI4Math benchmark, we spent approximately $1,750 on dataset creation (hackathon prize pool) and $290 on miscellaneous compute (mostly LLM API credits), bringing the direct out-of-pocket total to about $2,040 USD. This figure excludes other costs such as team salaries and online workspace expenses, which can vary widely by region.

In our case, most team members were employed on an hourly or, at most, half-time basis, and this benchmarking work was not our only responsibility, we also ran AI Safety mentorship programs concurrently. We estimate that a comparable team of ~3 FTEs could replicate our efforts in 3–4 months. This is encouraging for small teams and supports the idea that these evaluations are not exclusive to large or highly specialized research groups.

Conclusion

Our experience shows that independent evaluations don’t need to be massive or resource-intensive to be meaningful. With a clear experimental mindset, simple but transparent setups, and context-relevant tasks, small teams can generate results that genuinely inform both local decisions and the broader conversation about model reliability.

The strength of this approach lies not in building the next universal leaderboard, but in multiplying perspectives: more languages, more domains, more contexts where failure modes matter. Each evaluation becomes a piece of shared infrastructure, open, reproducible, and trustworthy.

As models continue to shape high-stakes environments, the question isn’t just “who performs best on a global benchmark,” but “what works well enough, under what conditions, for whom?” Answering that requires decentralization. Our hope is that others—whether researchers, practitioners, or local communities—take up these recommendations, adapt them to their needs, and keep building toward a healthier, more plural evaluation ecosystem.

LESSWRONG
LW