Moltbook, advertised as a social network for AI agents, has been going viral for "emergent" behaviour, including signs of misalignment. However, its not clear whether these are truly occurring autonomously, as people have been interpreting. To some extent, people are realizing the posts are heavily prompted by human users. But...
We built OpenForecaster, an 8B model trained to make predictions on open-ended forecasting questions. It is competitive with much larger proprietary models in held-out testing. We train it with RL on our OpenForesight dataset which has 52K forecasting questions created using an automated recipe from global news. This improves forecasting...
TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR’s underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR’s assumptions might be...
New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. TLDR: Using MCQs for AI benchmarking is problematic--you can guess the answer without even looking at the question (in multimodal MCQ datasets, without the image!). We knew this, but there didn't seem any alternative. We show now that language...
Here, I track my evolving thoughts on what remains on the path to building generally-intelligent agents. Why does this matter? Three compelling reasons: 1. Top-down view: AI research papers (and product releases) move bottom-up, starting from what we have right now and incrementally improving, in the hope we eventually converge...
There has been a flurry of recent papers proposing new RL methods that claim to improve the “reasoning abilities” in language models. The most recent ones, which show improvements with random or no external rewards have led to surprise, excitement and confusion. We analyzed 7 popular LLM RL papers (100+...
This post makes a simple point, so it will be short. I am happy to discuss more in the comments, and based on this write a longer post later. Much prior work (eg: [1]) has shown that exponential data and compute is required for each unit improvement in accuracy. A...