Thoughts on causal isolation of AI evaluation benchmarks

Dentosal

AI benchmarks seem to saturate quite quickly. One sentiment I've heard a lot is that AI companies optimize their training for the most popular benchmarks. In the best case, that could mean focusing more on getting better on the topics that are benchmarked the most, which is still somewhat suboptimal as the benchmarks tend to be a proxy for the real skill and now the AI is trained for the proxy. In the worst case, the AI training is iterated directly against the benchmark, causing overfitting and good benchmark results.

And avoiding this completely is not that easy. The training dataset is essentially the whole internet. When someone publishes a benchmark, the training set includes that. And people post benchmark solutions online too; those will be in the training data as well. Filtering all of that from the training data is not easy even if you actually try.

There's an intuitive solution, that some evaluators are doing: publish a part of the tasks. The rest, a holdout set, is kept private. If the models do better with the public set but not the holdout set, the effect can be measured. There are some limitations to this approach, which I'll discuss below.

Firstly, the holdout set actually needs to stay private. Sometimes the researchers just don't have enough integrity and sell out, like FrontierMath did (perhaps overly cynical take). Also, since the frontier models are not available for self-hosting, it's possible that the AI companies could just extract the benchmark questions from API logs, but this seems somewhat unlikely.

Most agents nowadays also have internet access. This means that even perfectly filtering the benchmark-related content from training data isn't enough. While you could disable it for benchmarking, then you're not actually measuring what a state-of-the-art system can do. Alternatively you could block or censor that content in the browser tool, but this is complicated and error-prone. Ideally, we could have a section of internet that's inaccessible to LLM training and LLM browser tools. Sadly, this won't work as it requires everyone with access to uphold these rules, which won't work.

There's another issue with leaking the benchmark contents to the model. Even if the actual solutions don't leak, knowledge of the specific benchmark can cause the model to realize it's under evaluation. Things to worry about here include trying to game the benchmark in some way, sandbagging, and in general performing differently in and out of evaluation.

A fully private benchmark is mostly safe from these issues, but is otherwise significantly less useful. As a milder version, the evaluators could be really vague about the actual contents. It's harder to review, harder to trust, and the actual results mean less if you don't know what exactly is getting tested. Using a trusted 3rd party auditor could help with this quite a bit.

Going a bit further, the evaluators could publicly lie or at least mislead in their published task set, substituting the actual evaluation with equivalent but different contents. Then the 3rd party auditors would need to be convinced that this is a reasonable thing to do. This seems like a terrible idea to me.

One way to completely mitigate this problem is testing with tasks that don't have a known answer, like open mathematical problems. This of course has the downside that the benchmark is rather hard and gives only a binary result per task (or at least partial solutions with novel approaches are hard to score). And if the open problem is solved, there's quite a bit of pressure to release the solution too. And again, if the problems are public, and the models widely available, anyone can solve the problem using these models and publish it.

Another way is having tasks that are really close to the actual thing. These are hard to game, as working solutions should also work outside the benchmark. They also make it hard to differentiate evaluation from the real world task, since it might as well be either one. Often partial solutions are gradeable too. The issues are mostly that real world benchmarks are hard to make deterministic, well defined, and are expensive to run. This is doubly so if the task has to be executed in the physical world, as is the case with robotics. Another problem is that interacting with the real-world has side-effects; a computer use benchmark might want to book flights, make a phone call or publish a blog post.

For instance, imagine a benchmark that tries to measure how well a non-technical founder could build a SaaS company. Naturally it has to be published somewhere, and using a cloud platform would be the usual call. Ideally you'd like to have a fake platform so that the site could be published only in your simulated internal system. But there are dozens of cloud services and the AI could choose any of them. A reasonable solution is railroading the AI to use a specific service, the one you've chosen to emulate. But this in itself is something that differentiates evaluation from the real world scenario. It might be a reasonable tradeoff, but it's important to realize it is one.

I haven't seen many of these real-world benchmarks yet. I'm not sure why, as it doesn't seem prohibitively expensive to me. There's OSWorld which contains small well-defined tasks and is relatively close to getting saturated. The recently released PhAIL that measures the ability to control an industrial robot arm in a single task seems quite interesting too, but that doesn't assess any of the frontier LLMs, only weights-available VLAs.

It would be nice to have more extensive real-world task benchmarks.

4

Thoughts on causal isolation of AI evaluation benchmarks

4

4

4