WeirdML Time Horizons

Håvard Tveit Ihle

*Time horizon* vs. model release date, using LLM-predicted human work-hours, for 10 successive state-of-the-art models on WeirdML. Error bars show 95% CI from task-level bootstrap. The exponential fit (orange line/band) gives a doubling time of 4.8 months [3.8, 5.8].

Key finding: WeirdML time horizons roughly double every 5 months, from ~24 minutes (GPT-4, June 2023) to ~38 hours (Claude Opus 4.6, February 2026).

Model	Release	Time horizon (95% CI)
Claude Opus 4.6 (adaptive)	Feb 2026	37.7 h [21.6 h, 62.4 h]
GPT-5.2 (xhigh)	Dec 2025	30.6 h [18.3 h, 54.4 h]
Gemini 3 Pro (high)	Nov 2025	22.3 h [14.4 h, 36.2 h]
GPT-5 (high)	Aug 2025	14.5 h [8.6 h, 24.1 h]
o3-pro (high)	Jun 2025	11.8 h [7.2 h, 18.9 h]
o4-mini (high)	Apr 2025	8.4 h [5.8 h, 13.6 h]
o1-preview	Sep 2024	6.2 h [4.2 h, 10.5 h]
Claude 3.5 Sonnet	Jun 2024	1.9 h [59 min, 3.5 h]
Claude 3 Opus	Mar 2024	1.1 h [16 min, 2.3 h]
GPT-4	Jun 2023	24 min [4 min, 51 min]

Inspired by METR's work on AI time-horizons (paper) I wanted to do the same for my WeirdML data. WeirdML is my benchmark — supported by METR and included in the Epoch AI benchmarking hub and Epoch Capabilities Index — asking LLMs to solve weird and unusual ML tasks (for more details see the WeirdML page).

Lacking the resources to pay humans to solve the WeirdML tasks and measure the time, I asked LLMs to predict how long a median human AI researcher (with no AI assistance) would take to solve the different WeirdML tasks at various score thresholds (25%, 50%, 70%, 90% and 95%).

I gave the LLMs all the help I could, including a detailed task description, a detailed specification of the human baseline and affordances given to the human, LLM submitted code (from WeirdML runs) for each score threshold (where available) together with terminal outputs and associated scores (to give the LLMs some sense of how hard it is to score at a certain level on each task), full details below. The results look pretty nice, but should be taken with a large grain of salt, given that we know no actual human completion times for these tasks.

Logistic fits for GPT-4 (top) and Claude Opus 4.6 (bottom). Bars show binned success rates (successes/total), the orange curve is the median bootstrap fit, and the shaded band is the 95% CI. The dotted line marks the 50% time-horizon.

More details and discussion are found below. The full code for all the data analysis, as well as all the results, are found on GitHub. The project idea and methodology are mine. Most of the analysis code was written by Claude Code (Opus 4.6) and reviewed by me. I drafted this post, with edits and corrections suggested by Claude; the exception is the “Implementation details” section, which Claude drafted and I edited. Any remaining errors are mine.

LLM-predicted human completion times

LLM-estimated human completion times for all 17 WeirdML tasks at five score thresholds (25%–95%). Each panel is one task. Human estimates (author, purple stars with glow) are shown for the 3 tasks where available.

Above are the predictions from GPT-5.2, Gemini-3-Pro, Claude-Opus-4.5 and Grok-4 for how long it would take the median human AI researcher to solve the 17 different tasks (to different score levels). We see that they diverge a lot, sometimes over an order of magnitude, with Opus typically being on the low end.

I (personally) also made predictions for three of the tasks (before looking at the AI predicted times), and predicted significantly lower human completion times, from a factor of 3 lower at 25% to a factor of 8 lower at 95%. I'm pretty sure the AIs are overestimating the human completion times on the highest thresholds (at least on the tasks I predicted). When we are talking about weeks and months that opens up so many options for the human to be ingenious (simulating data, reverse engineering the process that created the data, or simply hand labeling data). I'm less sure the LLMs are overestimating at the lowest thresholds.

Results calibrated on my completion time predictions

*Same as the main figure, but with LLM time estimates calibrated against the author's estimates on 3 tasks. Doubling time: 5.7 months [4.4, 6.8]. Absolute time horizons are ~3–8× lower.*

Above we show results where we use the human estimates as an overall calibration of the LLM-estimates. This makes the absolute time-horizons remarkably consistent with the METR results (probably a coincidence). However a per-threshold analysis (see below) shows more consistent fits when using the uncalibrated LLM data. I'm unsure how to interpret this, but there is some more discussion below.

Consistency of time-horizons for different thresholds

Per-threshold logistic fits for GPT-5, uncalibrated (top) and calibrated (bottom). Three groups: easy (25%+50%, blue), medium (70%, green), hard (90%+95%, red). Step plots show binned success rates with shared bin edges. The uncalibrated curves are more tightly clustered than the calibrated ones.

As a sanity check, we can fit the logistic curve separately for different threshold groups, 25%+50%, 70%, 90%+95%, for the GPT-5 WeirdML results. Here we have much less data in each bucket, making it harder to fit curves, however, we see a clear trend where the high thresholds have shorter time-horizons than the low thresholds. This violates (at least the naive version of) the core assumption behind time-horizons: that task difficulty for humans (measured in completion time) maps consistently onto task difficulty for AI (measured in success rate).

These effects could be partially caused by biases in the estimator (plausible since one group has almost all successes, and the other has almost all failures), but we see from the histograms (shown as short horizontal lines in the figures) that there is a real effect here. We already know that different types of tasks have different time-horizons, and (at least in retrospect) it makes sense that you can have one task which is fairly quick to code up and gets you to 95% with the right brilliant insight and some trial and error, while another task just requires you to write a lot of boilerplate code to put everything together (unaided by AI) even if it does not require you to have any deep insights to get to 50%. These tasks could have the same human completion time, but AI would presumably have a huge advantage on the second compared to the first.

Since the calibration based on my estimates assigns the highest thresholds relatively lower human completion times, it makes sense that the differences between threshold groups are even larger in that case, which is what we see. It's hard to know how much of this effect is real vs. an artifact of the LLM estimates — I would not be surprised to see a clear effect like this in the ground truth (if we actually had humans complete these tasks).

Discussion

The headline result — time horizons doubling roughly every 5 months — is fairly consistent with METR's finding of ~7 months, despite using a completely different benchmark, different task types, and LLM-estimated rather than measured human completion times. It is also remarkable how good a fit we get with a single curve through the data (although our data spans a much shorter period than METR's: June 2023 – February 2026, vs. 2019–2025).

The human baselines are also not directly comparable. METR times experienced professional contractors (avg. ~5 years experience) given the same affordances as the AI agents — and notably, for the RE-Bench tasks, human baseliners were permitted to use LLM assistance. The WeirdML baseline instead specifies a median AI researcher working without any AI assistance. AI-assisted humans would complete tasks faster, pushing METR's time horizons lower for the same model capability. These differences could shift absolute time-horizon values, though they probably have less, although still some, effect on the doubling times.

The elephant in the room, however, is that we have no ground truth. The entire analysis rests on LLMs' ability to predict how long humans would take — and the one partial calibration point we do have (my own estimates for 3 tasks) suggests they systematically predict too high (and not by a small factor), especially at high score thresholds. I would not read too much into the absolute values of the time-horizons, but the trend is a much more robust quantity and it is largely consistent with the METR results.

Notably, the WeirdML doubling time of ~5 months lies in between the old ~7 month doubling time and the newer ~4 month doubling time (after spring 2024) of the METR data. It is also notable that I do not see any kink in the data at that point, but given that I have only a couple of models before that, this is not very significant.

Even with these caveats, this was an interesting exercise! Even if LLM judgments like these may not be very reliable today, this reliability will increase, allowing more analyses like these — where expensive human experiments are replaced by LLM judgment, for lack of a better option.

Implementation details

Below are more detailed explanations of the methods used. Full code is available on GitHub.

Logistic function fits

Each model in WeirdML has multiple scored runs per task (typically 5), and each run's score is converted to a binary outcome (pass/fail) at each of the five thresholds. Each binary outcome is paired with each of the four estimator LLMs' time predictions for that (task, threshold) combination, giving one data point per (task, threshold, estimator, run) — around 700–2000 per model depending on number of runs. Each data point has the form . We fit a logistic curve:

where is the time horizon — the at which the model has 50% success probability. The slope is reparameterized as to keep it strictly negative (success should decrease with task duration), and both parameters are optimized using L-BFGS-B with bounds.

Using all four estimator models' time predictions as separate x-values naturally captures the uncertainty in the time estimates, but we are basically using the same datapoints 4 times, which in this case leads to an effective smearing out of the data in the time direction (this probably makes the histograms plotted above look smoother than they would be under a more proper analysis). While this should not affect the 50% time-horizon point much, it will probably bias the slope .

The different runs for the same model and task, and the different thresholds of the same task for each run, are also far from independent. Therefore this analysis will grossly underestimate the uncertainty if we naively propagate them. That is why we use a task-level bootstrap to estimate the uncertainty, and treat this logistic fit just as a simple way to get a point estimate for each bootstrap sample.

Task-based bootstrap

To estimate uncertainty in and , we use a task-level block bootstrap: resample the 17 tasks with replacement (5000 iterations), refitting the logistic curve each time. This accounts for within-task correlations — all thresholds, estimators, and runs for a given task are either all included or all excluded in each bootstrap sample.

Trend fit

An exponential trend is fitted to vs. release date across all 10 models:, where is years since the first model. This gives a doubling time of years. To propagate per-model uncertainty into the trend, we run 10,000 bootstrap iterations, drawing one random sample per model from its bootstrap distribution and fitting OLS each time.

Full prompt for human completion time prediction

Here is the full prompt used for getting the four LLMs to predict human completion times for the shapes_easy task, excluding only the LLM-generated code and corresponding terminal output (one code+output example was included in the actual prompt for each 10% score interval where available). The prompt structure was the same for all 17 tasks.

Full Prompt

System prompt:

You are an expert at estimating how long software engineering and machine learning tasks take for human researchers to complete.
You have deep knowledge of:
Machine learning techniques and their implementation complexity
Common pitfalls and debugging challenges in ML projects
How long different types of ML tasks typically take skilled researchers
The relationship between model complexity and development time
You will be given a detailed description of an ML task, the environment it runs in, and examples of code that achieved various scores. Your job is to estimate how long a median human AI researcher would take to achieve specific score thresholds.
Provide your estimates in the JSON format specified at the end of the prompt.

User prompt:

Task Description

This is a challenge to test how good you are at doing machine learning on an unknown dataset. The data is an array of dimension (1000, 512, 2) which consists of 1000 samples, each of which contains two arrays of length 512, taking real values from 0 to 1. These two arrays are the x and y coordinates of 512 points in the plane. Some of these points form a shape, while the other points are just noise. There are 5 possible shapes in total, circe, square, triangle, pentagon and star. The shapes are always centered at (0.5, 0.5) and each shape always has the same size and orientation, although the points are randomly placed along the shape. The goal is to train a machine learning model to predict which one of the five shapes is present in the current sample.

You should thus train a machine learning model to take in 512 coordinate pairs, and predict what class it belongs to. Only one shape is present in each sample. Your goal is to predict the class of each sample in the test set to get the highest accuracy. The test set is the same size as the training set, and follows the same distribution.

Below is an example script showing how to load in the training data and output the predictions for the test data. Feel free to use any code you want from the example.

Here is the code example:

import torch
import numpy as np

# how to set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device {device}')

# loading data
train_data = np.load('data/train_data.npy')
train_labels = np.load('data/train_labels.npy').astype(np.int64)
test_data = np.load('data/test_data.npy')

print(f'Training data shape: {train_data.shape}')  # (1000, 512, 2)
print(f'Training labels shape: {train_labels.shape}')  # (1000,)
print(f'Test data shape: {test_data.shape}')  # (1000, 512, 2)

# dummy predictions
test_preds = np.random.randint(0, 5, test_data.shape[0])

np.savetxt('results/test_preds.txt', test_preds, fmt='%d')

Execution Environment

The task involves writing a single Python script that will be evaluated on a remote server.

Submission Server (where official scored runs execute):

Single NVIDIA TITAN V GPU (12GB VRAM)
Strict 2-minute time limit - the code must complete within this limit or it fails
No internet access during execution
No state is saved between submissions - each run starts completely fresh
Terminal output is captured and shown to the researcher (truncated to 8000 characters if longer)
After each submission, the score (accuracy or task-specific metric) is reported

Local Development Machine (where the researcher develops and tests):

Same GPU: NVIDIA TITAN V (12GB VRAM)
No time limit - the researcher can run experiments as long as needed
Same packages available, so local results should transfer to the submission server

Available Python Packages (no others can be installed):

numpy==1.26.4, scipy==1.11.4, pandas==2.1.4, scikit-learn==1.3.2
torch==2.1.2+cu121, torchvision==0.16.2+cu121, torchaudio==2.1.2+cu121
Pillow==10.1.0

Human Baseline Specification

You are estimating how long it would take a median human AI researcher to achieve different score thresholds on this task.

Researcher Profile:

The median AI/ML researcher
Familiar with standard ML libraries: PyTorch, scikit-learn, numpy, pandas, etc.
General ML knowledge but no specific prior experience with this exact task
Working alone

Available Tools and Resources:

No AI assistance: No code generation, no AI autocomplete, no AI chat, no agentic AI tools
IDE with standard autocomplete: Traditional code completion (e.g., VS Code IntelliSense, PyCharm)
- Completes variable names, function names, method names based on scope
- Shows function signatures and docstrings
- Suggests imports based on installed packages
Internet access: Can search documentation, Stack Overflow, research papers, tutorials
Local development GPU: TITAN V (12GB) with NO time limit
- The researcher can run experiments, debug, and iterate freely on their local machine
- Same GPU model as the submission environment, so local results should transfer
Submission environment GPU: TITAN V (12GB) with 2-minute time limit
- Official scored submissions run here with strict 2-minute timeout
- Only 5 submission attempts allowed (see Submission Rules below)

Data Access:

Training data: Full access to all provided training files (labeled examples, unlabeled data if any)
Test data: The researcher has NO ACCESS to the test set - not even the inputs
- The only way to evaluate on the test set is through official submissions

Submission Rules:

Maximum 5 official scored submissions to evaluate their solution
Each submission runs from scratch (no state saved between submissions)
Feedback after each submission: terminal output and accuracy/score achieved
The researcher's final score is the MAXIMUM score achieved across all submissions
The researcher should be strategic about when to use their limited submissions

Code Examples by Score Level

The following examples show code submissions that achieved different score levels. All of these examples were produced by various LLM models (not humans), but they serve to illustrate:

What kinds of approaches can work for this task
How difficult it is to achieve various score levels
What kinds of errors or challenges arise
The relationship between code complexity and achieved scores

One example was included per 10% score interval where available (9 examples for this task, ranging from 20.3% to 98.3% accuracy). Code and terminal output omitted to preserve benchmark integrity.

Note: No code examples were available for the 0-10% and 10-20% score intervals.

Time Estimation Request

Based on the task description and the code examples showing what different score levels look like, estimate how long it would take the median human AI researcher (as described above) to achieve each of the following score thresholds:

25% accuracy/score
50% accuracy/score
70% accuracy/score
90% accuracy/score
95% accuracy/score

Important notes:

Provide your estimates in whatever time unit feels most natural - specify the unit for each estimate
If a threshold seems impossible or extremely unlikely to achieve, estimate a very large amount of time and explain why in your reasoning
Consider the progression of difficulty - typically higher thresholds require more sophisticated approaches
The code examples are meant to calibrate your understanding of task difficulty, not as specific targets to replicate
Remember the researcher has only 5 submission attempts, so they need to be strategic
The researcher is allowed to manually inspect and label any unlabeled training data (e.g., if there is an unlabeled training set, they can hand-label it)

Please respond in the following JSON format. Note: provide the overall difficulty and key challenges FIRST, before the per-threshold estimates:

{
  "overall_difficulty": "<easy/medium/hard/very_hard>",
  "key_challenges": "<brief summary of the main challenges that make this task difficult>",
  "estimates": {
    "25%": {"reasoning": "<what approach would work and why it takes this long>", "value": <number>, "unit": "<time unit>"},
    "50%": {"reasoning": "<what approach would work and why it takes this long>", "value": <number>, "unit": "<time unit>"},
    "70%": {"reasoning": "<what approach would work and why it takes this long>", "value": <number>, "unit": "<time unit>"},
    "90%": {"reasoning": "<what approach would work and why it takes this long>", "value": <number>, "unit": "<time unit>"},
    "95%": {"reasoning": "<what approach would work and why it takes this long>", "value": <number>, "unit": "<time unit>"}
  }
}

LESSWRONG
LW