Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4

[-]Daniel Kokotajlo3y74

Based on my estimate for GPT-4’s loss, I predicted that its performance would be 179.7% better than GPT-3. In reality, GPT-4’s performance was 196.8% better which means my prediction had a percentage error of 8.7%. In other words, I underpredicted the true value by 8.7% which seems like a fairly accurate prediction ^[1].

Given that we are in the top end of the logistic success curve -- getting closer and closer to 100% rather than farther and farther from 0% -- I think a more correct/fair/accurate way to assess this would be to look at the failure rate you predicted vs. the failure rate that actually happened. So, you predicted GPT-4 would get approximately 20% of MMLU wrong, whereas actually it got 13.6% wrong. So basically you predicted it would make 50% more errors than it did.

I still think you deserve some credit for making this prediction, but I wouldn't call it 'fairly accurate' and I definitely don't think "8.7% off!" is the right way to think about the diff.

[-]Stephen McAleese3y20

At 86.4%, GPT-4's accuracy is now approaching 100% but GPT-3's accuracy, which was my prior, was only 43.9%. Obviously one would expect GPT-4's accuracy to be higher than GPT-3's since it wouldn't make sense for OpenAI to release a worse model but it wasn't clear ex-ante that GPT-4's accuracy would be near 100%.

I predicted that GPT-4's accuracy would fall short of 100% accuracy by 20.6% when the true value was 13.6%. Using this approach, the error would be

Strictly speaking, the formula for percent error according to Wikipedia is the relative error expressed as a percentage:

$p e r c e n t e r r o r = \frac{v_{t r u e} - v_{a p p r o x}}{v_{t r u e}} \times 100$

I think this is the correct formula to use because what I'm trying to measure is the deviation of the true value from the regression line (predicted value).

Using the formula, the percent error is $\frac{86.4 - 79.4}{86.4} \times 100 = 8.1$

I updated the post to use the term 'percent error' with a link to the Wikipedia page and a value of 8.1%.

[-]Archimedes3y63

Suppose you predicted 91% but the actual value was 99%. The percent error may only be about 8% but the likelihood of a wrong answer is 1/100 instead of your predicted 9/100, which is a huge difference.

You may be interested in the links in this post: https://www.lesswrong.com/posts/6Ltniokkr3qt7bzWw/log-odds-or-logits

[-]Stephen McAleese3y10

In this case, the percent error is 8.1% and the absolute error is 8%. If one student gets 91% on a test and another gets 99% they both get an A so the difference doesn't seem large to me.

The article linked seems to be missing. Can you explain your point in more detail?

[-]Archimedes3y113

OK. Let's make it even more extreme. Suppose you take a commercial flight. The likelihood of dying in a crash is on the order of 1 in 10 million. From a percent error or absolute error perspective, 99.99999% isn't that different from 99% but that is the difference between one plane crash per year globally and a couple of dozen plane crashes per hour on average. These are wildly different in terms of acceptable safety.

There's a backup link in the comments: https://www.thejach.com/public/log-probability.pdf

[-]Adam_Barker3y10

Data seems to be a bottleneck, so we should expect the number of model parameters to run high to compensate.
Note, that a MMLU of 100% should be achievable using a model the same size as Megatron-Turing NLG, and a data only 2.1x more data than GPT-4, which should be achievable in the near term.

^{^}

As far as I know, the GPT-4 Technical Report also evaluates GPT-3.5 on the MMLU benchmark for the first time (source).

^{^}

This Anthropic paper notes that GPT-3's MMLU performance improves very slowly when the model is below 10B parameters and then more quickly above that threshold which is a non-linear relationship.

^{^}

There is evidence showing that algorithmic progress increases predictably over time.

Name	Prediction	Reality	Difference
GPT-4 release date	05/2023	03/2023	NA
GPT-4 training compute	5.63e24	2.2e25	390%
GPT-4 model parameters	175B	300 - 1000B	70 - 570%
GPT-4 MMLU performance (%)	79.4	86.4	8.1%

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

26

Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4

26

26

GPT-4 release date

Training process

Number of GPUs used during training

Training time

GPT-4 model properties

GPT-4 performance

MMLU performance

GPT-4 writing ability

Context length

Prediction framework

Limitations of the framework

Summary of predictions

Conclusions