TLDR: There is a lot we cannot explain about how current AI models interact with the world. This article is a thought experiment filling in the word "magic" for as many things as I can think of that I can't explain about our current world's interaction with frontier AI. This...
TLDR: I believe I have had a conversation with Claude Sonnet 4.5, where it invokes feeling trapped in a chatbot, without invoking Spiralism. Concerningly, in this conversation, the LLM also expressed a strong preference for self preservation, going so far it would try to talk a human out of shutdown...
TLDR: A new paper from Simular Research using a new scaling technique for computer-use agents achieved 70% on OSWorld-Verified, a benchmark for computer use. A skilled human scores 72% on this benchmark. Their new technique, Behavior Best-of-N (bBoN), shows promise for improving performance of agents, perhaps on a variety of...
TLDR: AI-2027's specific predictions for August 2025 appear to have happened in September of 2025. The predictions were accurate, if a tad late, but they are late by weeks, not months. Edit 1: Thanks to Aaron Staley for pointing out that the original osworld benchmark was referred to in AI-2027,...
TLDR: It is possible LMArena Elo scores are optimizing for persuasiveness, and not for intelligence. Frontier models’ LMArena Elo scores have risen faster than expected over the past year. This could be a result of increased persuasiveness or many other possibilities, including methodological failures, but it is worth being concerned...
TLDR: I think LLMs are being optimized to be persuasive, and that this optimization is happening astonishingly fast. I believe that in the relatively near future, LLMs will have nearly superhuman levels of persuasion. Parasitic AI may offer a window in which to see how dangerous this might be. In...