Over the last 200 years, society became steadily more liberal. Serfdom and slavery were largely abolished, democracy exploded and most women and children are free from violence. Redistribution is widely practiced and the median person is far richer, healthier and freer than ever before. Some neighborhoods in San Francisco already...
The time it takes an AI or a Human+AI team (a "cyborg") to complete a task is a key aspect of what we care about when we talk about capabilities. The relationship between AI capabilities and cyborg capabilities is very useful for forecasting AI timelines. We simply don’t have good...
ProgramBench is a new coding benchmark that all frontier models spectacularly fail. We’ve been on a quest for “hard benchmarks” for a while so it’s refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it’s impossible! What is ProgramBench? ProgramBench tests if a...
The low-background steel problem Modern steel is slightly radioactive. We did a lot of atomic testing in the 40s and 50s, and now our atmosphere has some amount of radioactive particles, which make their way into steel during production. This is mostly fine, but some scientific instruments require steel that...
[EDIT: LawrenceC, who works at METR, responds to this.] In this post, we’ll discuss three major problems with the METR eval and propose some solutions. Problem 1: The METR eval produces results with egregious confidence intervals, and the METR chart misleadingly hides this. Problem 2: There's a lack of sample...