GPT-4 Predictions

[-]wassname3y70

There may also be some architecture advances, although I'm unsure why we didn't see these recent LLM's. In Sam Altman's AC10 meetup Q&A he did say that GPT-4 would use a different loss function, what effect would that have? I have no idea.

You can see some examples in this Jan 2023 overview of transformer advances by Lilian Weng and The Transformer Family v2

[-]gwern3y40

he did say that GPT-4 would use a different loss function, what effect would that have? I have no idea.

One possibility is shifting the power law. See UL2 which combines the various denoising losses in what turns out to be a very good way: "U-PaLM: Transcending Scaling Laws with 0.1% Extra Compute", Tay et al 2022 - halving PaLM training requirements w/UL2 losses. I don't know if OA discovered UL2 first, but it's not all that exotic or subtle and is certainly something that many people ask themselves when they learn about the difference between bidirectional and unidirectional models: "why not train on both/all the losses?"

[-]LawrenceC3y*70

Thanks for writing this!

I think the crux of your estimate of compute usage is the following line:

total_gpus = 2000 to 15000

In May 2020 (!) Microsoft announced that they had built a supercomputer with 10,000 GPUs for OpenAI, which is often suggested to be the machine GPT-3 was trained on: https://news.microsoft.com/source/features/ai/openai-azure-supercomputer/

So it's very possible (albeit unlikely) that the number of total GPUs used for GPT-4 training could be higher than 15000!

Corrections/nitpicks:

GPT-3.5 finished training in early 2022, was released in November 2022, and demonstrated better quality answers than GPT-3. In December 2022, OpenAI released ChatGPT which is based on GPT-3.5 and fine-tuned for conversation.

code-davinci-002 and text-davinci-002 were first released in mid March 2022, soon after the InstructGPT paper, not November 2022. Source:
https://openai.com/blog/gpt-3-edit-insert/ (See also this reddit thread talking about text-davinci-002.)

The capability can be used with the latest versions of GPT-3 and Codex, text-davinci-003 and code-davinci-002.

Also, a nitpick: ChatGPT was released November 30th, 2022: https://openai.com/blog/chatgpt/

[-]gwern3y80

So it's very possible (albeit unlikely) that the number of total GPUs used for GPT-4 training could be higher than 15000!

OAers have noted that the cluster has, of course, been expanded heavily since the original 10k (albeit not what it is now). Morgan Stanley is saying that GPT-5 is being trained right now on 25,000 GPUs, up heavily from the original 10k, and implying that 'most' of the GPT-5 GPUs were used for GPT-4 which finished 'some time ago'; the mean of 10 & 25 is 17.5, so >15k seems entirely possible, especially if those GPUs weren't just installed.

[-]Stephen McAleese3y20

Thanks for the comment! I updated the paragraph to:

The GPT-3.5 models finished training and were released in 2022, and demonstrated better quality answers than GPT-3. In late 2022, OpenAI released ChatGPT which is based on GPT-3.5 and fine-tuned for conversation.

[-]cubefox3y10

The March blog post mentions text-davinci-003, but you only say text-davinci-002 was released in March. The latter seems more plausible, since it matches with the newsletter OpenAI sent out at the end of November: "New GPT-3 model: text-davinci-003".

Starting today, you can access text-davinci-003 through our API and playground at the same price as our other Davinci base language models ($0.0200 / 1k tokens).

So I think the "March" blog post has probably been edited and isn't decisive evidence that code-davinci-002 (the GPT 3.5 base model) actually came out in March.

[-]sanxiyn3y60

my rough guess is that GPT-4 will have twice the context length: 8192 tokens.

There is a Twitter rumor, supposedly based on a document leaked from OpenAI, which implies GPT-4 will have the context length of at least 32K(!).

[-]nostalgebraist3y50

GPT-4 will have twice the context length: 8192 tokens

code-davinci-002 already has a context window of 8000 tokens. Or at least, that is the max request length for it in the API.

[-]AlexMennen3y51

In the table of parameters, compute, and tokens, compute/(parameters*tokens) is always 6, except in one case where it's 0.6, one case where it's 60, and one case where it's 2.75. Are you sure this is right?

[-]Stephen McAleese3y10

Thanks for spotting this.

I noticed that I originally used the formula when it should really be $C \approx 6 D N$ because this is the way it's written in the OpenAI paper Scaling Laws for Neural Language Models (2020). I updated the equation.

The amount of compute used during training is proportional to the number of parameters and the amount of training data: $C \propto D N \to C \approx k D N \to C \approx 6 D N$ .

Where there is a conflict between this formula and the table, I think the table should be used because it's based on empirical results whereas the $C \approx 6 D N$ formula is more like a rule of thumb.

[-]AlexMennen3y71

My point wasn't that the equation didnt hold perfectly, but that the discrepancies are very suspicious. Two of the three discrepancies were off by exactly 1 order of magnitude, making me fairly confident that they are the result of a typo. (Not sure what's going on with the other discrepency).

[-]Stephen McAleese3y50

You were right. I forgot the 1B parameter model row so the table was shifted by an order of magnitude. I updated the table so it should be correct now. Thanks for spotting the mistake.

[-]Mo Putera3y30

Curious, what do you think now that GPT-4 is out?

[-]gwern3y30

GPT-4 will be a text-only model like GPT-3.

That was a very long time ago, and the current GPT-4, you agree, would've been trained well after. So we can't put too much weight on it.

The current reporting is pushing very heavily on multimodality, including the announcement by Microsoft Germany that it will be released next week & explicitly stating GPT-4 will do video, in addition to earlier reporting about images (as well as Altman musing about revenge porn).

[-]cubefox3y30

Several people have argued that Sydney/Bing Chat performs better on reasoning tasks than ChatGPT/GPT-3.5, apart from its questionable dialogue fine-tuning. It may therefore be GPT-4. Have you looked into this? How does it affect your analysis?

I think it seems that Sydney is not the big leap you seem to predict for GPT-4. Then again, Sydney may use a smaller model (like Curie or Babbage for GPT-3) to save on inference cost, while you seem to be talking about the largest davinci model only.

[-]Stephen McAleese3y40

I've seen some of the screenshots of Bing Chat. It seems impressive and possibly more capable than ChatGPT but I'm not sure. Here's what Microsoft has said about Bing Chat:

"We’re excited to announce the new Bing is running on a new, next-generation OpenAI large language model that is more powerful than ChatGPT and customized specifically for search. It takes key learnings and advancements from ChatGPT and GPT-3.5 – and it is even faster, more accurate and more capable."

If the model is more powerful than GPT-3.5 then maybe it's GPT-4 but "more powerful" is too vague and phrase to come up with any clear conclusions. I don't think I have enough information at this point to make strong claims about it so I think we'll have to wait and see.

[-]cubefox3y31

This is a great analysis!

[-]Lukas Finnveden3y*20

GPT-2 1.5B 15B 2.5794

Where does the "15B" for GPT-2's data come from, here? Epoch's dataset's guess is that it was trained on 3B tokens for 100 epochs: https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4/edit#gid=0

[-]Stephen McAleese3y*30

I used the estimate from a document named What's In My AI? which estimates that the GPT-2 training dataset contains 15B tokens.

A quick way to estimate the total number of training tokens is to multiply the training dataset size in gigabytes by the number of tokens per byte which is typically about 0.25 according to the Pile paper. So 40B x 0.25 = 10 billion.

[-]Adam_Barker3y20

Here's an equation for the MMLA vs Loss plot:

A MMLA = 100% corresponds to a loss of 1.8304. Using the scaling laws, listed here, this can be reached using:

The GPT-4 dataset (4Gtokens) and a model 11x the size of Megatron-Turing NLG (6 trillion parameters). Compute time: 111 days on Eos.
GPT-4's 175B params with 18.5 trillion training tokens (4.6x the size of GPT-4's dataset). Compute time: 16 days on Eos, but getting that many tokens may be a problem.
Megatron-Turing NLG's 530B parameters, and 8.5 trillion tokens (2.1x the size of GPT-4's dataset). Compute time: 23 days on Eos. This is a much more reachable dataset.

The nx compute speed of Eos used for GPT-4 was 18.4 ExaFLOP/s.

[-]Lucas L3y10

Futuur predicts a 32% chance that GPT-4 will be released by May 2023 and a ~74% chance that it will be released by the end of 2023 (probability generated in real money).

[-]eo rojas3y10

Thanks for this great overview of what's going on with the large language models. I want to indulge in some limited knowledge speculation. Just want to put that out there first. I'm going to predict that the larger language models are not going to be very much more useful from the current state. I'm wondering if you know or can talk about what people are doing to improve the models other than making them bigger? I read a Wolfram blog post that talked about using the Wolfram expert systems in conjunction with large language models of you looked at that? The approach of combining llms with with expert systems seems to be a valuable path to me in terms of creating more useful systems. I e, results from a LLM could be evaluated by expert systems and fed back into the text that's generated. Do you see this is the path to creating the even more intelligent systems?

[-]Capybasilisk3y20

There are currently attempts to train LLMs to use external APIs as tools:

https://cognitiveai.org/wp-content/uploads/2022/10/wang2022-behavior-cloned-transformers-are-neurosymbolic-reasoners-arxiv.pdf

https://arxiv.org/abs/2302.04761

[-]BrooksT3y10

Thanks for the excellent post! I don’t think you mentioned anything about technical improvements in training, such as efficiency of parallelization. Do you know if there are internet I f things going in there, or does that shake out as part of the GPU generational improvements?

[-]Stephen McAleese3y10

Thanks for bringing this up. I don't think I mentioned any algorithmic improvements apart from RETRO so these predictions are probably somewhat conservative.

[-]asasilogic3y10

In the last sentence of the last paragraph, you meant to write "...advancement from GPT-3 to GPT-4..."', yes?

[-]Stephen McAleese3y10

Thanks for spotting the typo! I updated the post.

[+][comment deleted]3y10

[+][comment deleted]3y00

^{^}

GPT-3 uses the exact same architecture as GPT-2 except it used “alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer” which reduces the context length to memory ratio from $O (n^{2})$ to $O (n \sqrt{n})$ .

^{^}

Nvidia claims that Selene can reach 2.8 EFLOP but that's Selene's theoretical performance using structural sparsity. Without sparsity, the performance is half that and half again given that typical GPU utilization is only 50%. 2800 / 4 = 700.

^{^}

When training dense models.

^{^}

Chinchilla's Wild Implications by nostalgebraist explains the formula well.

^{^}

Massive multitask language understanding.

^{^}

I wrote this post myself but imagine for a moment if it had been generated by a language model without any editing.

Parameters	Compute	Tokens
400M	1.99e19	8.30B
1B	1.34e20	22.43B
10B	1.67e22	278.2B
67B	9.91e23	2.43T
175B	3.22e24	6.67T
280B	1.90e25	11.03T
520B	7.07e25	22.63T
1T	2.84e26	47.27T
10T	3.87e28	644.57T

Model name	N	D	Loss
GPT	117M	1B	3.6644
GPT-2	1.5B	15B	2.5794
GPT-3	175B	300B	2.0023
Megatron-Turing NLG	530B	270B	1.9906
Gopher	280B	300B	1.9933
Chinchilla	70B	1.5T	1.9335
GPT-4	175B	4T	1.8727

Model	Loss	MMLU 5-shot accuracy
Random	N/A	25.0%
GPT-J (6B)	2.13363	27.3%
Average human rater	N/A	34.5%
GPT-3	2.00223	43.9%
Gopher	1.99326	60.0%
Chinchilla	1.93352	67.6%
PaLM (540B)	1.92387	69.3%
GPT-4 (predicted)	1.87274	78.9%
Average human expert	N/A	89.8%

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

112

112

112

Introduction

Background of GPT models

GPT-1, GPT-2, GPT-3

InstructGPT, GPT-3.5, ChatGPT

When will GPT-4 be released?

What will GPT-4 be like?

How capable will GPT-4 be?

Scaling laws

Training compute

GPT-3 training compute

AI training supercomputers

GPT-4 training compute

Model parameters

Training tokens

GPT-4 loss

GPT-4 performance

MMLU

Distinguishing AI-generated text from human text

GPT-4 context length

Emergent capabilities

Conclusion