Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Forecasting ML Benchmarks in 2023

4Qumeric

3Tomás B.

26gwern

6Tomás B.

13gwern

1CRG

1Not Relevant

19gwern

3Not Relevant

6gwern

3Not Relevant

4gwern

2Not Relevant

1Noosphere89

1Not Relevant

1Not Relevant

1Noosphere89

1Not Relevant

9gwern

1Tomás B.

1Hailey Collet

-3[anonymous]

New Comment

Large models: the largest version of Minerva is very large (540B parameters) and does 7% better than the 62B parameter model. It seems like it would be relatively expensive to continue improving performance solely by scaling up (but see below on undertraining).

This seems to be a silver lining - the comments on my posts about 10T Chinchilla models estimate these will not be economically feasible for at least 10 years. One thing I would like more insight into is if all the AI ASIC startups can be safely ignored. Hardware companies usually exaggerate to the point of mendacity, and so far their record has been a string of startups that NVIDIA has completely out-competed. But I would like to know how much probability I should have on some AI ASIC startup releasing something that makes training a 10-100T Chinchilla model feasible in less than 5 years.

Some of my "timelines" intuitions are coming from my perception of the rate of advancement in the last five years, but I am wondering if we've already eaten the easy scaling gains and things might be slightly slower for the next few years.

Hardware has a few dark horses: photonics, spiking neural nets on analogue hardware, and quantum computing are all paradigms that look great in theory and continue to fail in practice... but we know that these are the sorts of things which fail right up until they succeed, so who knows?

You should also consider that with the experience curves of tech & algorithms, and potential for much larger budgets, estimates like $20b (present-day costs) are not that absurd. (Consider that Elon Musk is currently on track to light a >$20b pile of cash on fire because he made an ill-considered impulse purchase online a few months ago. Not to be outdone, MBS continues to set $500b on fire by not building some city in the desert; he also announced $1b/year for longevity research, which should at least do a little bit better in terms of results.)

The big open question to me is, how much information is actually out there? I’ve heard a lot of speculation that text is probably information-densest, followed by photos, followed by videos. I haven’t heard anything about video games; I could see the argument being made that games are denser than text (since games frequently require navigating a dynamically adversarial environment). But I also don’t know that I’d expect the ~millions of existing games are actually that independent of each other. (Being man made, they’re a much less natural distribution than images, and they’ve all been generated for short-term human intelligence to solve).

We also run into the data redundancy question: all the video on the internet contains one set of information, and all the text another set, but these sets have huge overlap. (This is why multimodal models with pretrained language backends are so weirdly data-efficient.) How much “extra” novelty exists in all the auxiliary sources of info?

My attempt at putting numbers on the total data out there, for those curious:

* 64,000 Weibo posts per minute x ~500k minutes per year x 10 years = ~3T tokens. I’d guess there are at least 10 social media sites this size, but this is super-sensitive data sharded across competing actors, so unless it’s a CCP-led consortium I think upperbounding this at 10T tokens seems reasonable.

* Let’s say that all of social media is about a tenth of the text contributed to the internet. Then Google’s scrape is ~300T, assuming the internet of the last decade is substantially larger than that of the preceding decades.

* 500 hours of video uploaded to YouTube every minute x ~500k minutes per year x 10 years x 3600 seconds per hour x (my guess of) 10 tokens per second = ~90T tokens on YouTube.

* O(100 million) books published ever x100,000 tokens per book = 10T tokens, roughly.

Let’s assume Chinchilla scaling laws still provide the correct total quantity of data needed. Chinchilla scaling laws suggest ~200T tokens for a 10T-param model, so this does indeed seem like it’s in-range for Google (due either to their scrape or to YouTube) and maybe a CCP-led consortium.

(There is also obviously the possibility of collecting new data, or of building lots of simulators.)

(Unclear whether anyone else similarly scraped the internet, or whether enough of it is still intact for scrapers to go in afterward.)

180m books now

That's still just 20T tokens.

academic papers/theses are a few mill a year too

10M papers per year x 10,000 tokens per paper x 30 years = 3T tokens.

You raise the possibility that data quality might be important and that maybe "papers/theses" are higher quality than Chinchilla scaling laws identified on The Pile; I don't really have a good intuition here.

I spent a little while trying to find upload numbers for the other video platforms, to no avail. Per Wikipedia, Twitch is the 3rd largest worldwide video platform (though this doesn't count apps, esp. TikTok/Instagram). Twitch has an average of 100,000 streams going on at any given times x 3e8 tokens per video-year (x maybe 5 years) = 100T tokens, similar to YouTube. So this does convince me that there are probably a few more entities with this much video data.

I agree that if you put enough of these together, there are probably ~10 actors that can scrape together >200T tokens. This is important for coordination; it means the number of organizations that can play at this level will be heavily bottlenecked, potentially for years (until a bunch more data can be generated, which won't be free). It seems to me that these ~10 actors are large highly-legible entities that are already well-known to the US or Chinese governments. This could be a meaningful lever for mitigating the race-dynamic fear that "even if we don't do it, someone else will", reducing everything to a 2 party US-China negotiation.

The big problem is the cold war mentality is back, and both sides will compete a lot more rather than cooperate. Combine this with a bit of an arms race by China and the US, and the chances for cooperation on existential risk are remote.

This is a separate discussion, but it is important to point out that the literal Cold War had the opposing powers cooperate on existential risk reduction. Granted that before that, two cities were burned to ash and we played apocalypse chicken in Cuba.

Two more points:

- The specific upper bound
*does*matter if we’re worried about superintelligence. If easy-to-get data instead capped out at 10 quadrillion tokens, it’d be easy to blow past 10T-param models; if we conveniently threshold around human-level params, we might be more likely to be dealing with “fast parallel Von Neumanns” than a basilisk, at least initially. - Just to register a prediction: I would be very surprised if photos have anywhere near as much information content as text/video, given their relative lack of long-term causal structure.

In short, while concerted effort could plausibly give us human intelligence, it is likely not to go superhuman and FOOM.

I wouldn’t go that far; using these systems to do recursive self-improvement via different learning paradigms (e.g. by designing simulators) could still get FOOM; it just seems less likely to me to happen by accident in the ordinary coarse of SSL training.

93% in 2025 FEELS high, but ... Meta was already low for 2023, median 83% but GPT-4 scores 86.4%. If you plot 100%-MMLU sota training compute FLOPS (e.g. RoBERTa with 1.32*10^21 flops scores 27.9% so 72.1% gap, GPT-3.5 @ 3.14*10^23=30% gap, GPT4 @ ~1.742*10^25=13.6% gap), it should take roughly 41x the training compute of GPT-4 to achieve 93.1% ... so it totally checks.

(My estimate for GPT4 compute is based on the 1 trillion parameter leak, approximate number of V100 GPUs they have - they didn't have A100 let alone H100 in hand during GPT4 training interval - possible range of training intervals scaling laws and the training time=8TP/nx law, etc., and ran some squiggles ... it should be taken with a grain of salt, but the final number doesn't change in very meaningful ways for any reasonable assumptions so, e.g., it might take 20x or 80x but it's not going to 500x the training compute to get to 93%)

Will we ever get out of the GLUT era of ML? They say jobs like physicians and lawyers can get replaced by ML. Seems like that's because of the GLUT nature of those jobs, rather than about ML itself.

I guess a better question is at what point is the model no longer considered GLUT. Is intelligence ultimately a form of GLUT? Let's say you figured out the gravitational constant. Well, good for you. The process of deduction is basically looking for patterns in a look up table right? You do a bunch of experiments that tries to cover all edge cases, then you draw some generality out of your process. Without the generalization, you'd have to always go back to the look up table, but now you have a nice little equation that you can just plug in the values. No more GLUT, woohoo! Is this intelligence?

The generalization (coming up with algorithms, proofs, equations, etc) is just part of a process that allows you to use it in a simplified form as part of knowledge building process. The knowledge we build has become the intermediate look up table replacing the older, less efficient way of obtaining the same information. You can use the fact force = mass * acceleration or some constant equivalent of that equation without actually knowing the equation yourself, you are just less efficient at using it than in its simplest form. The person who generalized these knowledge aren't special or anything. It's just that they happened to be the ones doing it instead of someone else. Other people might be doing something else. The work is so rewarding, that's why we encourage people to pursue this line of work, you don't even need much monetary or other related type of incentives for people who are capable of this type of work to want to work on this. Monetary incentives tend to favor the needs of the population, and most people don't really have any immediate need for this type of work done. They are usually enjoying the far derivatives of these knowledge that they can benefit from.

So to build an AI, if you hard code the math symbols and their operations, is that any different than the type of model representation they would learn unsupervised in an open environment? They don't even have to operate on a world model, they can strictly be confined to the math space. Seems like that's what MATH learners are doing. These machines probably will never derive equations like F=m*a but they will have some internal representations of it in their models if they were to do well on MATH data set or the likes.

If we build a neural net that itself has built a computer/neural net, would that be a 3 layer deep recursion?

Thanks to Collin Burns, Ruiqi Zhong, Cassidy Laidlaw, Jean-Stanislas Denain, and Erik Jones, who generated most of the considerations discussed in this post.Previously, I evaluated the accuracy of forecasts about performance on the MATH and MMLU (Massive Multitask) datasets. I argued that most people, including myself, significantly underestimated the rate of progress, and encouraged ML researchers to make forecasts for the next year in order to become more calibrated.

In that spirit, I’ll offer my own forecasts for state-of-the-art performance on MATH and MMLU. Following the corresponding Metaculus questions, I’ll forecast accuracy as of June 30, 2023. My forecasts are based on a one-hour exercise I performed with my research group, where we brainstormed considerations, looked up relevant information, formed initial forecasts, discussed, and then made updated forecasts. It was fairly easy to devote one group meeting to this, and I’d encourage other research groups to do the same.

Below, I’ll describe my reasoning for the MATH and MMLU forecasts in turn. I’ll review relevant background info, describe the key considerations we brainstormed followed, analyze those considerations, and then give my bottom-line forecast.

## MATH

## Background

Metaculus does a good job of describing the MATH dataset and corresponding forecasting question:

It’s perhaps a bit sketchy for me to be both making and resolving the forecast, but I expect in most cases the answer will be unambiguous.

## Key Considerations

Below I list key considerations generated during our brainstorming:

## Analyzing Key Consideratoins

## Why did Minerva do well? How much low-hanging fruit is there?

Minerva incorporated several changes that improved performance relative to previous attempts:

Other log-hanging fruit:

I’d guess that corresponds to about a 4% improvement in the case of the MATH dataset (since making the model 8.7x bigger was a 7% improvement).

Overall summary: the lowest-hanging fruit towards further improvement would be (in order):

Aggregating these, it feels easy to imagine a >14% improvement, fairly plausible to get >21%, and >28% doesn’t seem out of the question. Concretely, conditional on Google or some other large organization deciding to try to further improve MATH performance, my prediction of how much they would improve it in the next year would be:

(This prediction is specifically using the "how much low-hanging fruit" frame. I'll also consider other perspectives, like trend lines, and average with these other perspectives when making a final forecast.)

## What kinds of errors is Minerva making? Do they seem easy or hard to fix?

As noted above, the 62B parameter model has best-of-256 performance (filtered for correct reasoning) of at least 68%. My guess is that the true best-of-256 performance is in the low-to-mid 70s for 62B. Since Minerva-540B is 7% better than Minerva-62B, the model is at least capable of generating the correct answer around 80% of the time.

We can also look at errors by type of error. For instance, we estimated that calculation errors accounted for around 30% of the remaining errors (or around 15% absolute performance). These are probably fairly easy to fix.

In the other direction, the remaining MATH questions are harder than the ones that Minerva solves currently. I couldn’t find results grouped by difficulty, but Figure 4 of the Minerva paper shows lower accuracy for harder subtopics such as Intermediate Algebra.

## How much additional data could be generated for training?

We estimated that using all of arXiv would only generate about 10B words of mathematical content, compared to the 20B tokens used in Minerva. At a conversation rate of 2 tokens/word, this suggests that Minerva is already using up most relevant content on arXiv. I’d similarly guess that Minerva makes use of most math-focused web pages currently on the internet (it looks for everything with MathJax). I’d guess it’s possible to find more (e.g. math textbooks) as well as to synthetically generate mathematical exposition, and probably also to clean the existing data better. But overall I’d guess there aren’t huge remaining gains here.

## Could other methods improve mathematical reasoning?

For math specifically, it’s possible to use calculators and verifiers, which aren’t used by Minerva but could further improve performance. Table 9 of the PaLM paper shows that giving PaLM a calculator led to a 4% increase in performance on GSM8K (much smaller than the gains from chain-of-thought prompting).

In the same table, we see that GPT-3 gets a 20% gain using a task-specific verifier. Given that the MATH problems are fairly diverse compared to GSM8K, I doubt it will be easy to write an effective verifier for that domain, and it’s unclear whether researchers will seriously try in the next year. The calculator seems more straightforward and I’d give a ~50% chance that someone tries it (conditional on there being at least one industry lab paper that focuses on math in the next year).

## Historical Rate of Progress on MATH

This is a roughly 2.9% accuracy gain per month (but almost certainly will be slower in future). Taking this extrapolation literally would give 85.1% for 06/30/2023.

## Historical Rate of Progress on Other Datasets

The Dynabench paper plots historical progress on a number of ML datasets, normalized by baseline and ceiling performance (see Figure 1, reproduced below).

We seem to often see immediate huge gains, while the next ones are somewhat slower.

Here’s another benchmark for reference. It got 67% -> 86% within 1-2 months, then took 4 months to break 90%.

Overall, it seems clear we should expect some sort of slow-down. In some cases, the slow-down was huge. I think progress should not slow down that much in this case since there’s still lots of low-hanging fruit. Maybe progress is 60% as fast as before? So that would give us 71% on 06/30/2023.

## How much will people work on improving MATH performance?

Two sources of progress:

How many language papers have been released historically?

(This only counts language models that achieved broad state-of-the-art performance. E.g. I'm ignoring OPT, BLOOM, GPT-J, etc.)

By this count, there have been 6 papers since the beginning of 2019. So base rate of around 1.7 / year. If we use a Poisson process, predicts that we will see 0 new papers with probability 18%, 1 with probability 31%, 2 with probability 26%, and >2 with probability 25%.

What about math-specific work? Harder to measure what “counts” (lots of math papers but how many are large-scale / pushing state-of-the-art?). Intuitively I’d expect more like 1.1 such papers per year. So around 33% chance of zero, 37% chance of 1, 20% chance of 2, 10% chance of >2.

An important special case is if there are no developments on either the language models or the math-specific front. Under the above model these have probabilities 18% and 33%, and are probably positively correlated. Additionally, it's possible that language model papers might not bother to evaluate on MATH or might not use all the ideas in the Minerva paper (and thus fail to hit SOTA). Combining these considerations, I’d forecast around a 12% chance that there is no significant progress on MATH on any front.

## Bottom-Line Forecast

From the above lines of reasoning, we have a few different angles on the problem:

If I intuitively combine these, I produce the following forecast:

The Metaculus community is at 74 median, upper 75% of 83. So I’ll adjust up slightly more. New forecast adjusted towards community prediction:

Rough approximation of this distribution on Metaculus (red is me, green is the community prediction):

Interestingly, Hypermind forecasts a much smaller median of 64.1%.

## MMLU Forecast

## Background

Again borrowing from Metaculus:

## Key Considerations

At a high level, these are fairly similar to those of the MATH dataset. Since more people have worked on MMLU and there’s been steadier progress, we rely more on base rates and less on detailed considerations of how one could improve it further.

[Same as previous consideration for MATH]## Analyzing Key Considerations

## Historical Rate of Progress on MMLU

Below is a time series of MMLU results, taken from the MMLU leaderboard (note MMLU was published in Jan. 2021). I've bolded few-shot/zero-shot results.

Chinchilla (70B, few-shot)Gopher (280B, few-shot)UnifiedQAGPT-3 (175B, few-shot)GPT-2If we restrict to few-shot results, we see:

It's not clear which time horizon is best to use here. I came up with an approximate base rate of

1.2 pts / month.Other notes:

## Historical Rate of Progress on Other Datasets

We analyzed this already in the previous section on MATH. It seems like there's usually an initial period of rapid progress, followed by a slow-down. However, MMLU has had enough attempts that I’d say it’s past the “huge initial gains” stage. Therefore, I don’t expect as much as a level-off compared to MATH, even though there is less obvious low-hanging fruit---maybe we'll get 75% as fast of progress as before. This would suggest

+10.8 pointsover the next year.## Combining Chinchilla and Minerva

The current SOTA of 67.5 comes from Chinchilla. But Minerva does much better than Chinchilla on the MMLU-STEM subset of MMLU. Here’s a rough calculation of how much taking max(Chinchilla, Minerva) would improve things:

So adding in Minerva would add (75% - 54.9%) * 19/57 = 6.7% points of accuracy.

Will this happen? It's not obvious, since PaLM is owned by Google and Chinchilla is owned by DeepMind. At least one org would need to train a new model. I think there’s a good chance this happens, but not certain (~65% probability).

## Other Low-Hanging Fruit

Result of a quick brainstorm:

In addition, the STEM-specific improvements (e.g. Minerva) will continue to improve MMLU-STEM. Based on the MATH forecast above, on median I expect about half as much improvement over the next year as we saw from the Minerva paper, or around another 3% improvement on MMLU overall (since Minerva gave a 6.7% improvement).

We thought it was possible but unlikely that there are significant advances in general knowledge retrieval in the next year that also get used by MMLU (~20% probability).

## How much will people work on improving MMLU performance?

Unlike MATH, there is nothing “special” that makes MMLU stand out from other language modeling benchmarks. So I’d guess most gains will come from general-purpose improvements to language models, plus a bit of STEM-specific improvement if people focus on quantitative reasoning.

## Bottom-Line Forecast

In some sense, MMLU performance is already “at” 74.2% because of the Minerva result. Additional low-hanging fruit would push us up another 5 points to 79.2%. Alternately, simply extrapolating historical progress would suggest 10.8 points of improvement, or 85%. Putting these together, I’d be inclined towards a median of 83%.

If we instead say that progress doesn’t slow down at all, we’d get 89%.

As before, I’d give an 18% chance of no new SOTA language model papers, in which case MMLU performance likely stays between 67.5% and 74.2%. This also means we should adjust the previous numbers down a bit.

Overall forecast:

This seems pretty similar to the Metaculus community prediction, so I won’t do any further adjustment.

Interestingly, the Hypermind median is only at 72.5% right now. Given the ability to combine Minerva + Chinchilla, this intuitively seems too low to me.

## Looking Ahead

My personal forecasts ended up being pretty similar to the Metaculus community forecasts, aside from me expecting slightly slower MATH progress (but only by about a percentage point). So, we can ask what Metaculus expects for 2024 and 2025 as well, as an approximation to what I "would" believe if I thought about it more.

MATH forecast (community prediction in green, top row of each cell):

MMLU forecast (community prediction in green):

So, on median Metaculus expects MATH to be at 83% in 2024 and at 88% in 2025. It expects MMLU to be at 88% in 2024 and at 93% (!) in 2025. The last one is particularly interesting: since MMLU tests domain-specific subject knowledge across many areas, it is predicting that a single model will be able to match domain-specific expert performance across a wide variety of written subject exams.

Do you agree with these forecasts? Disagree? I strongly encourage you to leave your own forecasts on Metaculus: here for MATH, and here for MMLU.