Is METR Underestimating LLM Time Horizons?
TL;DR * Using METR human-baseline data, I define an alternate LLM time-horizon measure, i.e. the longest time horizon over which an LLM exceeds human baseline reliability (or equivalently the intersection point of the human and LLM logistic curves), and this measure shows a much faster growth-trend than METR's fixed-threshold trends:...
@Michaël Trazzi Actually, it's the opposite, the Claude progress was dominated by slope (β) improvement, and intercept actually got a bit worse: Is METR Underestimating LLM Time Horizons?