Efficiency and resource use scaling parity

Ege Erdil

An interesting pattern I've now noticed across many different domains is that if we try to do an attribution of improvements in outcomes or performance in a domain to the two categories "we're using more resources now than in the past" and "we're making more efficient use of resources now than in the past", there is usually an even split in how much improvement can be attributed to each category.

Some examples:

In computer vision, Erdil and Besiroglu (2022) (my own paper) estimates that 40% of performance improvements in computer vision from 2012 to 2022 have been due to better algortihms, and 60% due to the scaling of compute and data.
In computer chess, a similar pattern seems to hold: roughly half of the progress in chess engine performance from Deep Blue to 2015 has been from the scaling of compute, and half from better algorithms. Stockfish 8 running on consumer hardware in 1997 could achieve an Elo rating of ~ 3000, compared to ~ 2500 for contemporary software; and Stockfish 8 on 2015 hardware could go up to ~ 3400.
In rapidly growing economies, accounting for growth in output per worker by dividing it into capital per worker (resource scaling) and TFP (efficiency scaling, roughly speaking) often gives an even split: see Bosworth and Collins (2008) for data on China and India specifically.

More pessimistic estimates of the growth performance of China compared to official data put this split at 75% to 25% (see this post for details) but the two effects are still at least comparable.

A toy model

A speculative explanation is the following: if we imagine that performance in some domain is measured by a multiplicative index which can be decomposed as the product of individual contributing factors $F_{1}, F_{2}, \dots, F_{n}$ so that $P \propto \prod_{i = 1}^{n} F_{i}$ , in general we'll have

$g_{P} = \frac{1}{P} \frac{d P}{d t} = n \sum i = 1 \frac{1}{F_{i}} \frac{d F_{i}}{d t} = n \sum i = 1 g_{F_{i}}$

thanks to the product rule. Note that $g_{X}$ denotes the growth rate of the variable $X$ .

I now want to use a law of motion from Jones (1995) for $F_{i}$ : we assume they evolve over time according to

$g_{F_{i}} = \frac{1}{F_{i}} \frac{d F_{i}}{d t} = θ_{i} F_{i}^{- β_{i}} I_{i}^{λ_{i}}$

where $θ_{i}, β_{i}, λ_{i} > 0$ are parameters and $I_{i}$ is a measure of "investment input" into factor $i$ . This general specification can capture diminishing returns on investment as we make progress or scale up resources thanks to $β$ , and can capture returns to scale to spending more resources on investment at a given time thanks to $λ$ .

Substituting this into the growth expression for $P$ gives

$g_{P} = \frac{1}{P} \frac{d P}{d t} = n \sum i = 1 θ_{i} F_{i}^{- β_{i}} I_{i}^{λ_{i}}$

Now, suppose we have a fixed budget $I$ at any given time to allocate across all investments $I_{i}$ , and our total budget $I$ grows over time at a rate $g$ . To maximize the rate of progress at a given time, the marginal returns to investment across all factors should be equal, i.e. we should have

$\frac{\partial}{\partial I_{i}} (\frac{1}{F_{i}} \frac{d F_{i}}{d t}) = \frac{\partial}{\partial I_{j}} (\frac{1}{F_{j}} \frac{d F_{j}}{d t})$

for all pairs $i, j$ . Substitution gives

$θ_{i} λ_{i} F_{i}^{- β_{i}} I_{i}^{λ_{i} - 1} = θ_{j} λ_{j} F_{j}^{- β_{j}} I_{j}^{λ_{j} - 1}$

and upon simplification, we recover

$\frac{λ_{i} g_{F_{i}}}{I_{i}} = \frac{λ_{j} g_{F_{i}}}{I_{j}}$

In an equilibrium where all quantities grow exponentially, the ratios $I_{i} / I_{j}$ must therefore remain constant, i.e. all of the $I_{i}$ must also grow at the aggregate rate of input growth $g$ . Then, it's easy to see that the Jones law of motion implies $g_{F_{i}} = g λ_{i} / β_{i}$ for each factor $i$ , from which we get the important conclusion

$g_{F_{i}} \propto \frac{λ_{i}}{β_{i}} = r_{i}$

that must hold in an exponential growth equilibrium. The parameter $r_{i}$ is often called the returns to investment, so this relation says that distinct factors account for growth in $P$ proportional to their returns to investment parameter.

How do we interpret the data in light of the toy model?

If we simplify the setup and make it about two factors, one measuring resource use and the other measuring efficiency, then the fact that the two factors account for comparable fractions in overall progress should mean that their associated return parameters $r_{i}$ are close. This prediction is validated in some domains:

A linearly accumulating factor that follows $d F / d t \propto I$ has $r = 1$ , and if this factor is raised to a power $α$ its returns become $r = α$ .

Since the capital share of the US economy is around $30 %$ , meaning roughly that we think total output scales with the capital stock raised to a power of $0.3$ ; and capital is thought to accumulate linearly, the parity hypothesis predicts $r = 0.3$ for US total factor productivity. Bloom et al. (2020) validates this prediction almost exactly.
When we focus on niche problems, we must be careful because some efficiency terms (e.g. Moore's law) are driven by worldwide investment that's not optimized for our niche application. This then becomes an exogenous progress term, and it becomes better to compare endogenous terms with each other instead.

In machine learning, compute scaling has been much faster than Moore's law, so we expect our model to be a good approximation in this case. If we assume spending on compute accumulates linearly, so that $r_{compute} = 1$ , the parity results suggest we should also expect $r_{software} \approx 1$ . I have some unpublished estimates that support something like this, but a public reference is Tom Davidson's takeoff model, where he estimates $r \approx 2.5$ in computer vision. Not quite $1$ , but not too far off either.

I think the toy model generally suggests that $r_{efficiency}$ tends to be $O (1)$ . We have no good a priori reason to expect this, but in fact, it seems to be the case. From one perspective, this is really another way of phrasing the parity hypothesis, but I think it's one that might be useful in thinking about the phenomenon.

Of course, there are other possible explanations for this "coincidence", but none of them seem completely satisfactory, which is rather disappointing.

LESSWRONG
LW

Efficiency and resource use scaling parity

47

A toy model

How do we interpret the data in light of the toy model?

New to LessWrong?

47