Predictions for GPT-N

by hippke1 min read29th Jul 202031 comments



Regarding GPT-3, there is some discussion whether growing the model would transform it into an Oracle AI. I looked into the actual benchmark results (Appendix H in the paper) to see if we can predict something useful from the actual measurements.

Method: The OpenAI team ran a suite of 63 different benchmarks (including sub-types), each for zero/one/few shot. In each scenario, there are 8 model sizes. I looked at how results scale with model size. With only 8 measurements, there is a large associated uncertainty for predictions. Formally, one would test the trend function using a
Bayesian model selection between a linear and (e.g.,) a polynomial. I did this for a few and then eye-balled the rest. So, please take the following as an indication only.

Disclaimer: The smallest model for GPT-3 has parameters, the largest . That's a span of 3 orders of magnitude. Scaling this out to many more orders of magnitude is dangerous. Thus, take these numbers only as an indication.

Results. For the following tests, I find an asymptotic trend. Scaling the model will apparently not yield fantastic results for:

  • HellaSwag, LAMBADA, PIQA, CoQA, OpenBookQA, Quac, RACE, CB, ReCoRD, WiC
  • Translations - but unclear level description.

In the following tests, it is unclear if the trend is asymptotic or better than that:

  • SAT: Could be linear, could be asymptotic. If linear, it will achieve 100% at parameters.
  • StoryCloze, Winograd, Winogrande, SQuADv2, DROP, Copa.

These tests show a linear scaling:

  • TriviaQA ( parameter estimate to achieve 100%)
  • BoolQ ()
  • MultiRC ()
  • ARC ()
  • SuperGLUE ()
  • WSC ()
  • WebQs ()
  • Cycled ()

Some tests scale neither linear nor asymptotic:

  • Symbol: Near exponential ()
  • Arithmetic: Exponential; one-digit composite may achieve 100% at
  • Reversed: Near exponential ()
  • Anagrams: Polynomial ()
  • ANLI: stepped, unclear
  • RTE: stepped, unclear

Summary: About half of the tested skills will likely not scale much with larger models. The other half will (e.g., TriviaQA, SuperGLUE, arithmetic, anagrams). Going to e.g., parameters - would that make an Oracle AI? Probably it's not sufficient, but I'm interested in hearing your opinion!