This doesn’t contradict common sense if you remember that Claude Opus 4.5 has a 50%-time horizon of around 4 hours 49 minutes (95% confidence interval: 1 hour 49 minutes to 20 hours 25 minutes).
Just think about it: from 1 hour 49 minutes up to 20 hours 25 minutes.
There simply isn’t enough data for reliable conclusions.
If I want to know what level of task complexity I can give an LLM to get a guaranteed correct answer, horizon length is a good measure.
We also need to challenge the assumption that a task that is hard for humans is equally hard for AI, and vice versa — that is, that there exists a strong correlation between human difficulty and AI difficulty.
Consider a simple analogy. Suppose we want to determine the time horizon of a calculator on multiplication tasks. Let one number have n digits and the other have m digits (in base 10). For a human, the complexity is O(n⋅m).
A calculator, however, will successfully solve all such tasks as long as they do not cause overflow, because its success does not correlate with computational complexity in the human sense, but rather with the number of digits in the result.
This illustrates that human task difficulty and AI task difficulty can be fundamentally misaligned.