TLDR: We tested whether frontier coding agents could autonomously implement AlphaZero for Connect Four in three hours. Some of them could do this very well, with Opus 4.7 sometimes performing better, by Bradley-Terry rating, than an external solver. In GPT-5.4's evaluations, it used much less of its time budget than...
TLDR: There is a lot we cannot explain about how current AI models interact with the world. This article is a thought experiment filling in the word "magic" for as many things as I can think of that I can't explain about our current world's interaction with frontier AI. This...
TLDR: I believe I have had a conversation with Claude Sonnet 4.5, where it invokes feeling trapped in a chatbot, without invoking Spiralism. Concerningly, in this conversation, the LLM also expressed a strong preference for self preservation, going so far it would try to talk a human out of shutdown...
TLDR: A new paper from Simular Research using a new scaling technique for computer-use agents achieved 70% on OSWorld-Verified, a benchmark for computer use. A skilled human scores 72% on this benchmark. Their new technique, Behavior Best-of-N (bBoN), shows promise for improving performance of agents, perhaps on a variety of...
TLDR: AI-2027's specific predictions for August 2025 appear to have happened in September of 2025. The predictions were accurate, if a tad late, but they are late by weeks, not months. Edit 1: Thanks to Aaron Staley for pointing out that the original osworld benchmark was referred to in AI-2027,...
TLDR: It is possible LMArena Elo scores are optimizing for persuasiveness, and not for intelligence. Frontier models’ LMArena Elo scores have risen faster than expected over the past year. This could be a result of increased persuasiveness or many other possibilities, including methodological failures, but it is worth being concerned...