Why I think evals are pretty important and most worth working on (for me)

Troy Tian

An application response I wrote! Feel free to leave feedback!

What are you most concerned about when it comes to risks from AI?

I’m most concerned that many people will be harmed very soon, and particularly that we won’t know why. Since politics and government dictate public life, solving said harm would require palatable translations of technical and sociological knowledge by experts for institutional changemakers to act on.

However, evaluations meant to exact such knowledge are systematically unreliable. Anthropic’s BrowseComp “benchmaxxed” by independently both achieving eval awareness and managing to scrape the specific eval it was being tested on. Human-designed audits may also structurally indicate evaluation (Gao and Kreiss), with distribution shifts making evaluation paradigms systematically inaccurate to general, out-of-lab use. Even seemingly optimistic advancements (Constitutional Classifiers) demonstrate the insufficiency of pure output-level evaluations, as safety now necessitates interpretability of internal activations.

Capabilities risks in particular are also fast accelerating, enough to potentially saturate even robust metrics like METR’s time horizons (Cotra). Certain MLE advancements (Joo et al.) are regarded “surprisingly” effective; implicitly, neither designed for nor necessarily predictable from first-principles. New findings like the Platonic Representation Hypothesis even conjecture that unintended capabilities improvements are actually systematic, as multimodal models converge to a shared statistical representation of reality. Some scholars (LeCun et al.), by contrast, purport that instead of convergence to general intelligence, various models trained to various specialisations will constitute a more legible and steerable “Superhuman Adaptable Intelligence”. This, however, is exactly my concern: not only are domain-specific superhuman intelligences structurally impossible to oversee (novice-grandmaster problem), but aforementioned evidence shows broad, unintended capabilities may arise from ostensibly unrelated changes or optimisations. Capabilities aren’t decomposable into what we can measure for, and certain framings of AGI (or “SAI”) like LeCun’s might obscure that fact.

AI systems as information/thought filters have sweeping social impacts; empirical studies show systematic LLM bias in news summarisation (Savgira et al.). Widespread adoption may enhance manipulation of public opinion and structurally constrain “responsible AI” within institutional profitability (Mitra). Broadly, my concerns regard our capabilities/risk evaluation methodologies and ontologies being systematically wrong. Without enforcing reliable ground truths, we risk suffering every technical problem at once as scientific voices may fail to move institutions away from trajectories of harm.

Works Cited

BrowseComp: https://www.anthropic.com/engineering/eval-awareness-browsecomp

Gao and Kreiss (case study of gender bias in LLMs): https://arxiv.org/pdf/2509.04373

Constitutional Classifiers: https://www.anthropic.com/research/next-generation-constitutional-classifiers

Cotra: https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities

Joo et al.: https://arxiv.org/pdf/2602.15322v1

Platonic Representation: https://arxiv.org/pdf/2405.07987

LeCun et al.: https://arxiv.org/abs/2602.23643v1

Savgira et al. (no link): What Stays and What Goes: Auditing the Impact of LLM Summarization on News Partisanship. Pavel Savgira, Elisa Kreiss, Homa Hosseinmardi. CHI conference on Human Factors in Computing Systems: Late Breaking Work 2026.

Mitra: https://disjunctionsmag.com/articles/why-leaving-big-tech/

To summarize,

“Evals don’t represent the real world, and are increasingly recognized as evals by AIs. And, capabilities we don’t know about, and can’t predict or account for from first principles, can arise in the cracks left behind by that fact. So unless we can fix this, there’s a real chance our attempts to coordinate and steer policy go astray because we are blind to what’s going on.”

Does that sound about right?

Yeah i think so! Do you think this is mostly true?

I think it makes sense, but I also think that focusing on evals isn’t going to help without first actually having a governance regime that cares about them, rather than it being eval outfits completely dependent on frontier lab funding, or government orgs that aren’t willing to say “extinction risk” out loud. To get to that point, we need the force of the democratic process on our side, and a real pause. Else, there will be no time or willpower to do careful evals like you describe; eval outfits only get a few days to a few weeks to do their work before a model is released, and there are too many incentives for decision makers to downplay or ignore bad news such as high eval awareness. I think if you were to get what you wish for, we’d have to already be in a competent slowdown regime.

I also think that a noticeable number of bad futures are locked in before evals even happen, during internal deployment or even development. Careful evals do nothing there; you would need to basically solve alignment for those, and no one has a credible plan for that that scales to arbitrary levels of intelligence.

Yeah, i think that that makes a lot of sense. I'm personally thinking of trying to do evals that INFORM policy, although advocacy is also not off the table.

To summarize,

Does that sound about right?

Yeah i think so! Do you think this is mostly true?

Yeah, i think that that makes a lot of sense. I'm personally thinking of trying to do evals that INFORM policy, although advocacy is also not off the table.