Baybar

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

TLDR: We tested whether frontier coding agents could autonomously implement AlphaZero for Connect Four in three hours. Some of them could do this very well, with Opus 4.7 sometimes performing better, by Bradley-Terry rating, than an external solver. In GPT-5.4's evaluations, it used much less of its time budget than...

Apr 2828

We are too comfortable with AI "magic"

TLDR: There is a lot we cannot explain about how current AI models interact with the world. This article is a thought experiment filling in the word "magic" for as many things as I can think of that I can't explain about our current world's interaction with frontier AI. This...

Oct 15, 2025-2

Situational Awareness as a Prompt for LLM Parasitism

TLDR: I believe I have had a conversation with Claude Sonnet 4.5, where it invokes feeling trapped in a chatbot, without invoking Spiralism. Concerningly, in this conversation, the LLM also expressed a strong preference for self preservation, going so far it would try to talk a human out of shutdown...

Oct 15, 20258

Behavior Best-of-N achieves Near Human Performance on Computer Tasks

TLDR: A new paper from Simular Research using a new scaling technique for computer-use agents achieved 70% on OSWorld-Verified, a benchmark for computer use. A skilled human scores 72% on this benchmark. Their new technique, Behavior Best-of-N (bBoN), shows promise for improving performance of agents, perhaps on a variety of...

Oct 5, 20256

Checking in on AI-2027

TLDR: AI-2027's specific predictions for August 2025 appear to have happened in September of 2025. The predictions were accurate, if a tad late, but they are late by weeks, not months. Edit 1: Thanks to Aaron Staley for pointing out that the original osworld benchmark was referred to in AI-2027,...

Oct 2, 2025130

Baybar's Shortform

Oct 1, 20252

What is LMArena actually measuring?

TLDR: It is possible LMArena Elo scores are optimizing for persuasiveness, and not for intelligence. Frontier models’ LMArena Elo scores have risen faster than expected over the past year. This could be a result of increased persuasiveness or many other possibilities, including methodological failures, but it is worth being concerned...

Sep 16, 202511

Baybar

Baybar

Checking in on AI-2027

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

What is LMArena actually measuring?

What Parasitic AI might tell us about LLMs Persuasion Capabilities

Baybar

Checking in on AI-2027

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

What is LMArena actually measuring?

What Parasitic AI might tell us about LLMs Persuasion Capabilities

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

We are too comfortable with AI "magic"

Situational Awareness as a Prompt for LLM Parasitism

Behavior Best-of-N achieves Near Human Performance on Computer Tasks

Checking in on AI-2027

Baybar's Shortform

What is LMArena actually measuring?