Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks

shash42

This post makes a simple point, so it will be short. I am happy to discuss more in the comments, and based on this write a longer post later. Much prior work (eg: [1]) has shown that exponential data and compute is required for each unit improvement in accuracy. A popular argument this leads to:

Scaling compute and data is economically not viable above a threshold

This argument has a key issue. Log-linear gains in accuracy (or loss) are shown on "one-step" benchmarks such as direct QA. However, economic benefits primarily arise from tasks which require long thinking, output and action horizons. For this, I propose tracking k-step success rate, i.e. the probability that a k-step solution achieves its goal. Most benchmarks previously analyzed for log-linear scaling have at best a few (small constant) steps. For simplicitly, let us assume this constant is 1, this does not affect the rest of the argument (based on asymptotic calculations in k) beyond a constant factor.

We can make two (simplifying) assumptions to connect log-linear scaling in 1-step accuracy to k-step success rate. (1) Each step has an independent probability to succeed, related to the 1-step accuracy by a constant. (2) All steps need to be correct for the full k-step solution to succeed. This is true for many useful tasks, like solving math problems, and an agent executing a task on the web. While recovering from failures is possible, subsequent steps might violate Assumption (1), so it's best to simplify this for now. Now the main mathematical argument is as follows:

Let us define the 1-step accuracy as . Every unit increase in 1-step accuracy from $x$ % to $(x + 1)$ % leads to an improvement in k-step accuracy of $(x + 1)^{k} - x^{k}$ . The absolute improvement is ~ $k x^{k}$ by taking the derivative of $f (x) = x^{k}$ , i.e. the improvement compounds based on the horizon-length. The relative improvement is $\frac{k}{x}$ by the binomial approximation. Thus the k-step success rate improves exponentially in task horizon length.

As inference-compute starts outweighing pretraining compute and AI is used for longer-horizon tasks like automated coding, jobs and research, log-linear scaling of x with pre-training compute will be worth it as the k-step success rate will improve much faster for large k.

Further, we can use similar analysis to understand recent results that the length of tasks that AI can do is increasing exponentially with time. Thanks to @nostalgebraist for providing this math in the comment below.

log-linear scaling of x with pre-training compute will be worth it as the k-step success rate will improve near-linearly

I don't follow. The k-step success is polynomial in x, not exponential (it's $x^{k}$ , not $k^{x}$ ).

Although if we fix some cutoff $c$ for the k-step success probability, and then look at the value of k for which $x^{k} = c$ , then we get $k = log (c) / log (x) \sim 1 / log (x)$ . This is super-linear in x over the interval from 0 to 1, so linearly growing improvements in x cause this "highest feasible k" to grow faster-than-linearly. (Is this what you meant? Note that this is similar to how METR computes time horizons.)

This might explain recent results that the length of tasks that AI can do is increasing linearly with time.

METR found that horizon lengths are growing exponentially in time, not linearly.

(One-step success probabilities have been growing at least linearly with time, I would think – due to super-linear growth in inputs like dataset size, etc. – so we should expect horizon lengths to grow super-linearly due to what I said in the previous paragraph.)

(N.B. I expect it will be easier to conduct this kind of analysis in terms of $l o g i t (x)$ instead of $x$ .)

Thanks! I fixed the last paragraph accordingly. I indeed wanted to say faster-than-linearly for the highest feasible k.