shash42

That's quite possible. I'm not sure how much that plays out with reinforcement learning training though.

Replying toHow to game the METR plot

This is cool! I think I'm updating toward the logistic fit not mattering. The question I have now is: what would it have taken on this underlying data for the log-linear trend not to hold. My guess is models not making progress for months, and staying at similar aggregate accuracy (with success rates staying roughly inversely correlated with task length).

Replying toClaude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins

Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins

The mean estimate of 50% success horizon length (headline number METR reports) went from ~1 to ~4 hours. The progress within the hour subranges is difficult to draw much information from, given the low number of data points, and distribution biases in topics. This is the precise claim of the new post I made, and linked :)

Replying toHow to game the METR plot

Thanks for checking this. Log-linear isn't that different from logistic in how it would affect the downstream prediction. Could you (someone at METR) update the public all-results file on GitHub so we can play around with this data?

I am particularly curious to know what would happen if we took the 50% horizon as the startpoint of the first bar the model drops below 50% accuracy. This increases uncertainty, but it would be interesting to see what trend comes out, and how model rankings change (is opus 4.5 a big update?).

I do expect it would still be an exponential trend, and agree with you that the underlying data distribution (specifically the topics aligning exactly with frontier lab priorities) is the more risky confounder. Although one could argue for choosing to do it this way, it just reduces chances of the horizon length being relevant outside the model's strongest areas.

Replying toHow to game the METR plot

thats an interesting point. If I kept adding points to the right, i.e. longer and longer tasks which I know the model would fail on, it would keep making the line flatter? That kind of makes me wonder, once again, if its even a good idea to try and fit a line here...

Replying toHow to game the METR plot

Thanks, I should've done that myself instead of lazily mentioning what it "looked like". R^2=0.51 is still a lot lower than the initial 0.83. Though same as before, I am not fully sure what this implies for the logistic model chosen, and downstream conclusions.

Replying toClaude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins

Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins

https://www.lesswrong.com/posts/2RwDgMXo6nh42egoC/how-to-game-the-metr-plot

Claude's performance is low on the 2-4 hour range, which mostly consists of cybersecurity tasks, potentially dual-use for safety. In general, training on cybersecurity CTFs and ML code would increase "horizon length" on the METR plot, which only has 14 samples in the relevant (1 - 4hr) range where progress happened in 2025.

Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

2mo

TL;DR: In 2025, we were in the 1-4 hour range, which has only 14 samples in METR’s underlying data. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR’s assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—in research, its hard to be perfect on the first release. But I’m tired of what is being inferred from this plot, pls stop!

14 prompts ruled AI discourse in 2025

The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms... (read 1374 more words →)

236

•••

Its a linkpost for https://arxiv.org/abs/2509.09677

I did make it a linkpost, not sure if just adding a summary isn't traditional?

Replying toIncorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

shash427mo

Thanks these are some great ideas. another thing you guys might want to look into is shifting away from mcqs towards answer matching evaluations: https://www.lesswrong.com/posts/Qss7pWyPwCaxa3CvG/new-paper-it-is-time-to-move-on-from-mcqs-for-llm

New Paper: It is time to move on from MCQs for LLM Evaluations

New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations.

7mo

TLDR: Using MCQs for AI benchmarking is problematic--you can guess the answer without even looking at the question (in multimodal MCQ datasets, without the image!). We knew this, but there didn't seem any alternative. We show now that language models are good enough, using small open-source ones to match generative responses to a ground-truth reference answer works much better, and turns out to be cheaper than MCQ evals!

Discriminative Shortcuts in MCQ

We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc.... (read 571 more words →)

An Alternative Way to Forecast AGI: Counting Down Capabilities

Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

8mo

Here, I track my evolving thoughts on what remains on the path to building generally-intelligent agents. Why does this matter? Three compelling reasons:

1. Top-down view: AI research papers (and product releases) move bottom-up, starting from what we have right now and incrementally improving, in the hope we eventually converge to the end-goal. This is good, that’s how concrete progress happens. At the same time, to direct our efforts, it is important to have a top-down view of what we have achieved, and what are the remaining bottlenecks towards the end-goal. Besides, known unknowns are better than unknown unknowns.

2. Research prioritisation: I want this post to serve as a personal compass, reminding me... (read 609 more words →)

Replying toIncorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

shash429mo

Yes, that is a good takeaway!

Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims

Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks

9mo

There has been a flurry of recent papers proposing new RL methods that claim to improve the “reasoning abilities” in language models. The most recent ones, which show improvements with random or no external rewards have led to surprise, excitement and confusion.

We analyzed 7 popular LLM RL papers (100+ to 3000+ likes, 50k+ to 500k+ views on X) including “Spurious Rewards”, “RL from 1 example”, and 3 papers exploring “Intrinsic Confidence Rewards”. We found that in most of these papers the improvements could be a mirage due to various accidental issues in the evaluation setups (discussed below). As such, the baseline numbers of the pre-RL models are massively underreported compared to official... (read more)

10mo

This post makes a simple point, so it will be short. I am happy to discuss more in the comments, and based on this write a longer post later. Much prior work (eg: [1]) has shown that exponential data and compute is required for each unit improvement in accuracy. A popular argument this leads to:

Scaling compute and data is economically not viable above a threshold

This argument has a key issue. Log-linear gains in accuracy (or loss) are shown on "one-step" benchmarks such as direct QA. However, economic benefits primarily arise from tasks which require long thinking, output and action horizons. For this, I propose tracking k-step success rate, i.e. the probability that... (read 256 more words →)

shash42's Shortform