Two scenarios:

- I take a vision or language model which was cutting edge in 2000, and run it with a similar amount of compute/data to what's typically used today.
- I take a modern vision or language model, calculate how much money it costs to train, estimate the amount of compute I could have bought for that much money in 2000, then train it with that much compute

In both cases, assume that the number of parameters is scaled to available compute as needed (if possible), and we generally adjust the code to reflect scalability requirements (while keeping the algorithm itself the same).

Which of the two would perform better?

CLARIFICATION: my goal here is to compare the relative importance of insights vs compute. "More compute is actually really important" is itself an insight, which is why the modern-algorithm scenario talks about compute cost, rather than amount of compute actually used in 2000. Likewise, for the 2000-algorithm scenario, it's important that the model only leverage insights which were already known in 2000.

LW linkpost here.