To summarize,
“Evals don’t represent the real world, and are increasingly recognized as evals by AIs. And, capabilities we don’t know about, and can’t predict or account for from first principles, can arise in the cracks left behind by that fact. So unless we can fix this, there’s a real chance our attempts to coordinate and steer policy go astray because we are blind to what’s going on.”
Does that sound about right?
An application response I wrote! Feel free to leave feedback!
What are you most concerned about when it comes to risks from AI?
I’m most concerned that many people will be harmed very soon, and particularly that we won’t know why. Since politics and government dictate public life, solving said harm would require palatable translations of technical and sociological knowledge by experts for institutional changemakers to act on.
However, evaluations meant to exact such knowledge are systematically unreliable. Anthropic’s BrowseComp “benchmaxxed” by independently both achieving eval awareness and managing to scrape the specific eval it was being tested on. Human-designed audits may also structurally indicate evaluation (Gao and Kreiss), with distribution shifts making evaluation paradigms systematically inaccurate to general, out-of-lab use. Even seemingly optimistic advancements (Constitutional Classifiers) demonstrate the insufficiency of pure output-level evaluations, as safety now necessitates interpretability of internal activations.
Capabilities risks in particular are also fast accelerating, enough to potentially saturate even robust metrics like METR’s time horizons (Cotra). Certain MLE advancements (Joo et al.) are regarded “surprisingly” effective; implicitly, neither designed for nor necessarily predictable from first-principles. New findings like the Platonic Representation Hypothesis even conjecture that unintended capabilities improvements are actually systematic, as multimodal models converge to a shared statistical representation of reality. Some scholars (LeCun et al.), by contrast, purport that instead of convergence to general intelligence, various models trained to various specialisations will constitute a more legible and steerable “Superhuman Adaptable Intelligence”. This, however, is exactly my concern: not only are domain-specific superhuman intelligences structurally impossible to oversee (novice-grandmaster problem), but aforementioned evidence shows broad, unintended capabilities may arise from ostensibly unrelated changes or optimisations. Capabilities aren’t decomposable into what we can measure for, and certain framings of AGI (or “SAI”) like LeCun’s might obscure that fact.
AI systems as information/thought filters have sweeping social impacts; empirical studies show systematic LLM bias in news summarisation (Savgira et al.). Widespread adoption may enhance manipulation of public opinion and structurally constrain “responsible AI” within institutional profitability (Mitra). Broadly, my concerns regard our capabilities/risk evaluation methodologies and ontologies being systematically wrong. Without enforcing reliable ground truths, we risk suffering every technical problem at once as scientific voices may fail to move institutions away from trajectories of harm.
Works Cited
BrowseComp: https://www.anthropic.com/engineering/eval-awareness-browsecomp
Gao and Kreiss (case study of gender bias in LLMs): https://arxiv.org/pdf/2509.04373
Constitutional Classifiers: https://www.anthropic.com/research/next-generation-constitutional-classifiers
Cotra: https://www.planned-obsolescence.org/p/i-underestimated-ai-capabilities
Joo et al.: https://arxiv.org/pdf/2602.15322v1
Platonic Representation: https://arxiv.org/pdf/2405.07987
LeCun et al.: https://arxiv.org/abs/2602.23643v1
Savgira et al. (no link): What Stays and What Goes: Auditing the Impact of LLM Summarization on News Partisanship. Pavel Savgira, Elisa Kreiss, Homa Hosseinmardi. CHI conference on Human Factors in Computing Systems: Late Breaking Work 2026.
Mitra: https://disjunctionsmag.com/articles/why-leaving-big-tech/