In February 2023, I wrote a post named GPT-4 Predictions which was an attempt to predict the properties and capabilities of OpenAI’s GPT-4 model using scaling laws and knowledge of past models such as GPT-3. Now that GPT-4 has been released, I'd like to evaluate these past predictions.
Unfortunately, since the GPT-4 technical report has limited information on GPT-4’s training process and model properties, I can’t evaluate all the predictions. Nevertheless, I believe I can evaluate enough of them right now to yield useful insights.
GPT-4 release date
OpenAI released GPT-4 on 14 March 2023.
I mentioned in the post that Metaculus predicted a 50% chance of GPT-4 being released by May 2023 and consequently, I expected the model to be released sometime around the middle of the year so the model was released earlier than I expected.
GPT-4 model properties
I predicted that GPT-4 would be a dense, text-only, transformer language model like GPT-3 trained using more compute and data with a similar number of parameters and a longer context window. OpenAI hasn’t yet published information such as the number of parameters in the model so I can’t evaluate these predictions yet.
My most obviously incorrect prediction was predicting that GPT-4 would be a text-only language model like GPT-3. Instead, GPT-4 is a multimodal model that accepts both text and images as inputs though it only outputs text.
Apart from that, I think my predictions about the model were mostly correct: GPT-4 is a pre-trained transformer language model trained using next-word prediction like its predecessors.
Number of GPUs used during training
Some people such as LawrenceC and gwern have noted in the post’s comments that GPT-4 was probably trained on 15,000 GPUs or more. Assuming this is true, my prediction that GPT-4 would be trained on 2,000 to 15,000 GPUs seems like an underprediction and consequently, I may have underpredicted GPT-4’s total training compute by about a factor of 2.
The OpenAI GPT-4 video states that GPT-4 finished training in August 2022. Given that GPT-3.5 finished training in early 2022 this suggests that GPT-4 was trained for about 4-7 months. I originally predicted that the training time would be 1-6 months which seems like an underprediction in retrospect.
Fortunately, both my post and the GPT-4 technical report referenced the MMLU benchmark. In the post, I predicted that GPT-4 could set a new record on the MMLU benchmark. I specifically predicted that GPT-4 could achieve 79.4% accuracy on the benchmark given my prediction of the model’s loss which is better than the previous record of 75.2% set by a fine-tuned version of PaLM.
GPT-4 in fact achieved 86.4% on the MMLU benchmark which is a new record and higher than I predicted. My prediction vs GPT-4’s actual accuracy on the MMLU benchmark is shown in the following graph. Note that since I don’t know GPT-4’s actual loss I used its predicted loss in the graph.
The percent error between my prediction and the true value is 8.1% which seems like a fairly accurate prediction .
GPT-4 writing ability
Based on GPT-3’s improvement trend from the GPT-3 paper, I also predicted that human evaluators would only be able to distinguish model-generated text from human text about 50% of the time. In other words, I predicted that GPT-4’s text would be indistinguishable from human-written text.
From my personal experience, GPT-4-generated text seems indistinguishable from human-written text though there doesn't seem to be any quantitative evaluation of this metric for GPT-4 yet.
Given that GPT-3 and GPT-3.5 had context lengths of 2048 tokens and 4096 tokens respectively, my guess was that GPT-4 would have a context length of 8192 tokens.
According to the OpenAI API, one of the GPT-4 models does indeed have a context length of 8192 tokens. However, there is another model with 32,768 tokens. Therefore, my prediction was partially correct but also underestimated the increase in context length.
My predictions of GPT-4’s performance were based on the following assumptions:
- Model loss can be accurately calculated using scaling laws that can estimate a model's loss given inputs such as the number of parameters in the model, the amount of training compute, and training data.
- There is a power law relationship between increases in these inputs and decreases in loss.
- Decreases in model loss are linearly correlated with improved performance as measured by benchmarks such as MMLU .
- GPT-4 includes no significant algorithmic advances that would significantly increase the model’s compute efficiency, data efficiency, or performance.
The prediction framework is summarized in this diagram:
Despite the fact that these simplifying assumptions could have limited the accuracy of the prediction model, I believe I was able to at least predict GPT-4's loss and some of GPT-4’s capabilities fairly accurately given knowledge of scaling laws derived from the behavior of smaller model and knowledge of the current capabilities, model properties, and training process of GPT-3 and similar models.
Similarly, the GPT-4 technical report includes details on how OpenAI used smaller models to predict GPT-4’s performance:
"A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance."
Given that OpenAI has full access to all information about GPT-3 and GPT-4, their predictions were probably more accurate than mine.
Limitations of the framework
- I think the biggest limitation of the framework is its neglect of algorithmic advances such as the introduction of image inputs to the GPT-4 model. Not taking algorithmic advances into account could also explain why I underestimated GPT-4's performance improvement on the MMLU benchmark.
- Although the average capabilities of language models tend to scale smoothly given more resources, specific capabilities can increase abruptly because of emergent capabilities. Therefore, a model that predicts linear improvements on certain capabilities in the short term could merely be a short tangent in a more complex non-linear model. This suggests that predicting specific capabilities in the long term is significantly more difficult.
GPT-4 was released earlier than I expected and consequently, I published the “GPT-4 Predictions” post just a month before the release of GPT-4 which possibly limited its utility. Given that the post was mostly based on data from 2020 and 2021 on models such as GPT-3, I think I could have made the predictions much earlier without a significant loss of accuracy. For example, if I had written the post in early 2021 it would have been published 2 years before the release of GPT-4.
I focused on benchmarks such as MMLU but I can now see from the GPT-4 technical report that human tests such as the SAT are also useful for evaluating language models.
I didn’t make any predictions on the safety improvements of GPT-4 over GPT-3 and such predictions could have been insightful.
My predictions seem to be evidence that it's possible to use scaling laws and other predictable quantitative methods to predict the general performance of language models at least in the short term.
Given the increased effect of algorithmic advances on ML capabilities in the long term and the inherent unpredictability of scientific progress, I expect accurately predicting the capabilities of ML models in the long term (>5 years) to be much more challenging .
As far as I know, the GPT-4 Technical Report also evaluates GPT-3.5 on the MMLU benchmark for the first time (source).
This Anthropic paper notes that GPT-3's MMLU performance improves very slowly when the model is below 10B parameters and then more quickly above that threshold which is a non-linear relationship.
There is evidence showing that algorithmic progress increases predictably over time.
Given that we are in the top end of the logistic success curve -- getting closer and closer to 100% rather than farther and farther from 0% -- I think a more correct/fair/accurate way to assess this would be to look at the failure rate you predicted vs. the failure rate that actually happened. So, you predicted GPT-4 would get approximately 20% of MMLU wrong, whereas actually it got 13.6% wrong. So basically you predicted it would make 50% more errors than it did.
I still think you deserve some credit for making this prediction, but I wouldn't call it 'fairly accurate' and I definitely don't think "8.7% off!" is the right way to think about the diff.
At 86.4%, GPT-4's accuracy is now approaching 100% but GPT-3's accuracy, which was my prior, was only 43.9%. Obviously one would expect GPT-4's accuracy to be higher than GPT-3's since it wouldn't make sense for OpenAI to release a worse model but it wasn't clear ex-ante that GPT-4's accuracy would be near 100%.
I predicted that GPT-4's accuracy would fall short of 100% accuracy by 20.6% when the true value was 13.6%. Using this approach, the error would be 20.6−13.613.6=0.51
Strictly speaking, the formula for percent error according to Wikipedia is the relative error expressed as a percentage:
I think this is the correct formula to use because what I'm trying to measure is the deviation of the true value from the regression line (predicted value).
Using the formula, the percent error is 86.4−79.486.4×100=8.1
I updated the post to use the term 'percent error' with a link to the Wikipedia page and a value of 8.1%.
Suppose you predicted 91% but the actual value was 99%. The percent error may only be about 8% but the likelihood of a wrong answer is 1/100 instead of your predicted 9/100, which is a huge difference.
You may be interested in the links in this post: https://www.lesswrong.com/posts/6Ltniokkr3qt7bzWw/log-odds-or-logits
In this case, the percent error is 8.1% and the absolute error is 8%. If one student gets 91% on a test and another gets 99% they both get an A so the difference doesn't seem large to me.
The article linked seems to be missing. Can you explain your point in more detail?
OK. Let's make it even more extreme. Suppose you take a commercial flight. The likelihood of dying in a crash is on the order of 1 in 10 million. From a percent error or absolute error perspective, 99.99999% isn't that different from 99% but that is the difference between one plane crash per year globally and a couple of dozen plane crashes per hour on average. These are wildly different in terms of acceptable safety.
There's a backup link in the comments: https://www.thejach.com/public/log-probability.pdf