Independent evaluations are both possible and valuable. Our goal is to widen the conversation on decentralized, reproducible, context-aware evaluations as public infrastructure for AI oversight, especially in regions and languages that frontier work often overlooks.
Our recommendations (based on what actually worked for us):
If more small teams adopt this approach—local languages, transparent methods, light but reliable stats—we’ll get a healthier, more trustworthy evaluation ecosystem.
Most well-known benchmarks are built and interpreted by a small number of well-resourced actors—big labs, elite universities, and a few private platforms. That concentration helped the field move fast, but it also created blind spots.
We don’t think centralized efforts are “bad.” We think they’re insufficient on their own—especially as models show up in places where failure matters (education, healthcare, legal workflows, public services). At that point, evaluation stops being just a research nicety and becomes a governance question: who gets to define “good enough,” for whom, and under what conditions?
Our own experience with AI4Math and our follow-up on repetitions convinced us that small teams can produce useful evaluations if they keep things simple and scientific: make problems that actually reflect your context; publish your configs; and treat the exercise like an experiment, not a leaderboard race. That approach lets us give cost-aware guidance without needing huge infrastructure.
This isn’t just for research groups. If you’re a practitioner deciding whether to deploy an LLM in a classroom, a clinic, or a back-office process, you need task-specific, context-specific evidence. A lightweight, experimental evaluation—designed around your real use case—can tell you far more than a global leaderboard ever will. Done right, it becomes standard operating procedure: define the task, test transparently, report uncertainty, and only then make the call.
Bottom line: Independent evaluations are both possible and valuable. They widen the lens, strengthen trust, and help the community converge on methods that travel: transparent setups, reproducible runs, and attention to the contexts where these systems actually live.
If centralized, English-first evaluations miss important contexts, the remedy is simple: build local, transparent, reproducible setups that others can copy. That’s what we did, first by creating a Spanish-native dataset, then by defining a clear comparison setup, and finally by running the whole thing as an experiment to learn how many repetitions you actually need.
Here are the practical recommendations we now follow (and suggest to others). They’re written so you can consider implementing them directly if you’re doing something similar.
So far, we have shown that it’s feasible for small teams to run independent LLM evaluations. Whether you should invest in producing these evaluations should be decided case by case. That said, running these “experiments” to assess LLM capabilities can be generalized across multiple fields. We therefore want to share roughly how much we invested, so you can factor this into planning your own evaluations.
For the AI4Math benchmark, we spent approximately $1,750 on dataset creation (hackathon prize pool) and $290 on miscellaneous compute (mostly LLM API credits), bringing the direct out-of-pocket total to about $2,040 USD. This figure excludes other costs such as team salaries and online workspace expenses, which can vary widely by region.
In our case, most team members were employed on an hourly or, at most, half-time basis, and this benchmarking work was not our only responsibility, we also ran AI Safety mentorship programs concurrently. We estimate that a comparable team of ~3 FTEs could replicate our efforts in 3–4 months. This is encouraging for small teams and supports the idea that these evaluations are not exclusive to large or highly specialized research groups.
Our experience shows that independent evaluations don’t need to be massive or resource-intensive to be meaningful. With a clear experimental mindset, simple but transparent setups, and context-relevant tasks, small teams can generate results that genuinely inform both local decisions and the broader conversation about model reliability.
The strength of this approach lies not in building the next universal leaderboard, but in multiplying perspectives: more languages, more domains, more contexts where failure modes matter. Each evaluation becomes a piece of shared infrastructure, open, reproducible, and trustworthy.
As models continue to shape high-stakes environments, the question isn’t just “who performs best on a global benchmark,” but “what works well enough, under what conditions, for whom?” Answering that requires decentralization. Our hope is that others—whether researchers, practitioners, or local communities—take up these recommendations, adapt them to their needs, and keep building toward a healthier, more plural evaluation ecosystem.