This post is one part of the sequence Understanding the diffusion of large language models. As context for this post, I strongly recommend reading at least the 5-minute summary of the sequence.
Up until the release of OPT-175B in May 2022, incremental research had been the prevailing diffusion mechanism for gaining direct access to the weights of a GPT-3-like model. After the release of OPT-175B, the prevailing mechanism has been the combination of replication and open publication. What follows is my reasoning and further thoughts on the mechanisms of diffusion of GPT-3-like models:
Caveat: I expect the above conclusions to change when it comes to the diffusion of future state-of-the-art language models, due to:
In contrast to the above changes, I expect the diffusion of models with similar performance to GPT-3 (rather than greater performance) will accelerate in the future.
Below I discuss the most important factors for diffusion that I determined in the course of my research and that fell within my scope. Note that these are factors that made developing GPT-3-like models easier or more likely by the largest margin in various cases.[17] I don’t consider the core resources for developing GPT-3-like models as “factors” themselves—those resources (mainly compute and talent) are discussed in the previous post. Overall, I’m 80% confident that all of these factors are important enough for a longtermist researcher to spend at least one month full-time thinking about how to beneficially affect each of these factors.[18]
I think that the difficulty of accessing enough compute has been the largest hindering factor to the diffusion of GPT-3-like models. This was the case up until the release of OPT-175B in May 2022, after which GPT-3-like models became much more accessible.[19] My claim is based on the following evidence:
I think that the difficulty of acquiring the necessary machine learning and engineering expertise was the second largest hindering factor to the diffusion of GPT-3-like models. To clarify, this claim is specifically about having the expertise to overcome the challenges of training large language models. This claim is not about the expertise to independently discover algorithmic insights, though I believe that is a lesser hindering factor. The claim is based on the following evidence:
So far, I think the most important factor for lower-resourced actors to approach GPT-3-like capabilities has been the sponsorship of compute by separate parties. This accelerating factor is the flip side of challenges of acquiring compute as a hindering factor—sponsorship allows these actors to leap over the obstacle of acquiring compute.
The first key example is that CoreWeave provided compute to EleutherAI for free to develop and train GPT-NeoX-20B. According to Sid Black, one of the main contributors to developing GPT-NeoX-20B, EleutherAI spent nothing out of pocket on compute for the GPT-NeoX project. Prior to this, EleutherAI was using a TensorFlow Research Cloud (TFRC) scheme that provided free access to TPUs, but this was not sufficient to train GPT-3.[26] The incentive for CoreWeave was to have their hardware tested as they were starting up their cloud computing operation, and to gain insight on what is required to use their hardware for training large language models.[27] The incentive for TFRC prior to this seemed to be testing their TPU hardware and advertising the advantages of that hardware.[28]
The second key example of compute sponsorship from my case studies is that BigScience was provided €3M from French research agencies CNRS and GENCI to train the BLOOM model on the Jean Zay supercomputer (BigScience, 2022).
Sponsorship can enable actors to use models closer to the cutting edge than they'd otherwise have, to do research on such models, and to increase the number of people with access to these models (e.g., as happened with BLOOM open-sourcing its weights). But does the sponsorship of resources like compute ultimately matter for who develops transformative AI (TAI)? I think the sponsorship of resources is less likely to matter than diffusion among AI developers who can already afford paying for the resources themselves, because the actors receiving sponsorship will tend to be lower-resourced to begin with, and therefore less likely to keep up with or surpass the state-of-the-art. However, I think sponsorship is a factor worth bearing in mind when thinking about which actors could plausibly become contenders to develop TAI in the future, and when thinking about how to beneficially shape diffusion.[29]
To see this, consider that the sponsorship of compute could give smaller actors the necessary momentum to become more significant actors. As with the BigScience case, there could also be a big role for governments and associated funding agencies to play in sponsoring massive amounts of resources for AI developers. This is already the case in China. The Beijing Academy of Artificial Intelligence, Zhejiang Lab, and Peng Cheng Lab are Chinese government-sponsored entities that have provided support for funding and compute to recent AI research projects in China (Ding & Xiao, forthcoming). For instance, Peng Cheng Lab was involved in PanGu-alpha.
Open-source tools that are specifically designed for large-scale model training were a notable accelerating factor in the cases I studied. There are two things to clarify about this:
The Megatron-LM codebase was first published in September 2019. It started as the code implementing NVIDIA’s 8-billion parameter language model, Megatron, which was introduced in Shoeybi et al. (2019).[30] Megatron was heavily based on the 1.5-billion-parameter GPT-2, the predecessor of GPT-3.[31] The Megatron-LM codebase was later used in Narayanan et al. (2021),[32] which as the title suggests, offers useful insights on efficient large-scale language model training.
Shelvane (2022) claims that the Megatron code release “made it very easy for anyone to train GPT-2-like models if they had access to enough GPUs; Aaron [a Brown University graduate student who replicated GPT-2] told [the author] that with the Megatron code and enough money, a high school student could do it.”[33] By the same logic, I make a similar claim for the current Megatron-LM codebase (after the “efficient large-scale training” paper was published) with respect to GPT-3. The Megatron-LM codebase has formed a significant part of the overall code base for OPT-175B, Jurassic-1-Jumbo, GPT-NeoX-20B, BLOOM, and Megatron-Turing NLG—though the latter is not really relevant to diffusion, since NVIDIA was directly involved.[34] The fact that Meta AI and AI21 Labs both used Megatron-LM code suggests that they benefit from open-source tools released by other actors. So the benefit is not limited just to small actors that tend to have less engineering talent, such as academic labs or independent collectives.
It’s difficult to quantify how much the Megatron-LM code helps, and it certainly does not remove most of the compute cost. The code merely helps with implementation. But given the prevalence of the Megatron-LM code in my case studies, I expect that it significantly reduces the talent barrier to start a GPT-3-like model development project. It probably also saves time and money by improving efficiency. Sid Black of EleutherAI told me that Megatron-LM and another tool called DeepSpeed were frustrating and time-consuming to use and extend. Despite that, he said that Megatron-LM is “really fast” and he was glad to have these tools available when developing GPT-NeoX-20B.
A similar tool which is often used alongside Megatron-LM is Microsoft’s DeepSpeed. According to the GitHub repo, “DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.” DeepSpeed, or a “forked” version of it on GitHub, was used in all the case studies where Megatron-LM was used except OPT-175B (as far as I could tell).
Similar specialized open-source software is used by other AI developers. In the Chinese sphere, there is MindSpore, which was used to train PanGu-alpha. Google’s PaLM used T5X and JAX, while DeepMind’s Gopher and Chinchilla used JAX and Haiku—though these are less specialized for language model training than Megatron-LM is.
Although it is difficult to measure and track the effects of the hype surrounding an AI research result, I believe that hype is an important accelerating factor in the diffusion of GPT-3-like models, and will probably play a key role in the diffusion of future state-of-the-art machine learning models. What I mean by hype is a combination of (a) the amount of attention that something gets, (b) the belief that the thing is promising in some way, e.g., it’s something worth replicating, or reveals a research direction worth pursuing. My point about the importance of hype here is related to my previous takeaway about the importance of attention to information.
First of all, GPT-3 was surprising in some sense. I estimate that GPT-3 was published 11 months earlier than expected based on training compute trends at the time (90% CI: 5 to 17 months).[35] Second, the insight which GPT-3 demonstrated was significant. Shelvane (2020, pp. 15-16) explains this point: “The idea [of the release strategy of GPT-2 and GPT-3] was that the models themselves were the hardest thing for bad actors to recreate, given the high compute costs required to produce the models. This was assuming that the papers, in contrast, did not contain truly novel insights. However, this focus on models has been questioned, with some risk-conscious AI researchers arguing that the GPT-3 paper was actually the risky thing. The paper, alongside other papers that OpenAI published in 2020, demonstrated to many onlookers the benefits of scale: if you throw a large amount of compute and data at a model with a very high number of parameters, you can get very impressive capabilities. Some people viewed this as dangerous in that it accelerates the field’s progress towards advanced AI, thus giving the world less time to prepare” (my emphasis).
A massive increase in hype around GPT-3 occurred not when the GPT-3 paper (Brown et al., 2020) was first published, but after people started demonstrating capabilities with the OpenAI API on Twitter.
I’m very uncertain whether this hype strongly influenced the subsequent R&D decisions of specific leading AI developers. My best guess is that the knowledge of GPT-3’s existence sped up both DeepMind and Google’s work scaling up language models by six months (90% CI: 1–18 months). But I have not been able to distinguish whether this acceleration was driven by insider knowledge, or the publication of GPT-3, or the hype generated after publication, or some combination of those factors. In addition to the surprisingness and hype of GPT-3 argued above, I have the following evidence for this claim:
Here I introduce the concept of a diffusion cascade: the acceleration of diffusion that results from diffusion of artifacts that are relevant to producing a given closed-source model. The concept of a diffusion cascade applies when initially there is a given closed-source model that is only accessible to one actor, and no other actor fully understands how to produce that model and/or has all the resources needed to produce that model.[40] The incremental progress and open sourcing made by other actors in the meantime fills in the gaps in knowledge and resources, and thereby accelerates diffusion. Even if the latest capability advance is only reachable by leading AI developers initially, those leading developers can make diffusion to other actors happen more easily and sooner than otherwise.
Below I list some specific drivers of diffusion cascades, and empirical examples of those drivers being involved in diffusion cascades. I also indicate the current relative importance of each driver on a sub 1-5 scale (5 is most important) according to my judgment, which is based on a combination of independent reasoning and the empirical examples. Importance means how much this driver has accelerated diffusion empirically.[41]
The obvious way to slow down a diffusion cascade, and diffusion in general, is to have greater secrecy. In the absence of coordination, the best that one actor can do on this front is to try to keep knowledge of a project or model completely secret, not even revealing the model’s existence.
My impression is that it is not uncommon to keep models secret temporarily (i.e., delaying publication past the minimum time needed to produce a publication).
One thing to note here is that while a model may remain secret to the general public until it is published, I suspect that information does sometimes leak, especially among peers in AI development at different labs.[47] Rumors can also circulate, even to the public, though it’s unclear when this is intentional and when it is unintentional. For example, Hao (2020) seems to refer to the text-to-image model DALL-E (or similar preliminary work) 11 months before DALL-E was announced (Ramesh et al., 2021).[48]
Besides just delaying publication, actors could limit diffusion cascades (if that is their goal) through more comprehensive secrecy around information and resources—even if the existence of the model and research results about the model are publicized. Given the various information sources and artifacts that can drive a diffusion cascade, it would be more effective to not just keep the model secure, but also e.g., the specialized software tools that were used to train the model, and the datasets, and the details of training infrastructure and parallelism strategies. For example, the developers of GPT-3 did not explain or open-source the software tooling that was used to train the GPT-3 model. This seems to have left a gap that Narayanan et al. (2021) had to spend time filling (i.e., with the Megatron-LM codebase).
I used 3 methods to estimate when experts would have expected GPT-3 (or the rough equivalent) to be released, immediately before GPT-3 was actually publicized. Estimating this provides evidence about the extent that multiple discovery was involved in the diffusion of GPT-3-like models, and about the counterfactual impact of publicizing GPT-3. The estimates are detailed in the following subsections.
First I analyze how unexpected GPT-3 was in terms of the average trend in training compute for models over time. My analysis is based on this interactive plot of compute trends by Epoch. Below are the initial steps I took and the results I obtained from different plots:
I used the weighted average as the central estimate, and the filtered standard deviation to get 90% confidence bounds. Thus my first estimate for the expected arrival time of GPT-3 is June 2022 (90% CI: August 2021 to April 2023). A major limitation of this estimate is that I am using a prediction of the average milestone system rather than a prediction of the most expensive system. Including the “Large Scale” trends in my aggregate prediction compensates for this somewhat (because the “Large Scale” data has the most expensive systems), but the above average predictions are probably still later than experts actually expected. Due to this limitation, I only put 30% weight on this estimate.
One way to improve on the first estimate is to look at when the trend predicts GPT-3’s training compute minus some amount of deviation based on the variance in the data. Due to time constraints I have not computed a confidence interval in the trendline. However, visually inspecting the Language category data over the whole “Deep Learning era” in this plot, we can see that data points that are about 1 order of magnitude above the trend line are common. For example, Meena in Jan 28, 2020 has 1.1E+23 FLOP while the trend is at about 1E+22 FLOP, and Seq2Seq LSTM in Sep 10, 2014 has 7.3E+18 FLOP while the trend is at about 4E+17 FLOP. The biggest outlier is GNMT (Sep 26, 2016) at 6.9E+21 FLOP when the trend is only at about 2E+19 FLOP; however, I think this is too large an outlier to significantly weight people’s best-guess expectations about when GPT-3’s amount of training compute would be used.
Based on this rough inspection, I will just look at when the trendline predicts one order of magnitude lower than the true value, i.e., when it predicts 3E+22 FLOP rather than 3E+23 FLOP. This appears to occur in late July 2020, only 2 months after GPT-3 was actually published.
Based on this, I chose 2 months as my central estimate for the time that GPT-3 was expected (in terms of training compute), relative to when it was actually published. Like the first estimate, I used the filtered standard deviation of 10 months to get confidence bounds. Thus my second estimate for the expected arrival time of GPT-3 is July 2020 (90% CI: December 2019 to May 2021). Although this estimate is less rigorous than the first estimate, I think it is closer to the quantity I’m actually trying to estimate, so I put 50% weight on it.
Finally, I have some evidence about the expected timing of GPT-3 from one researcher who has trained large language models at an AI safety lab. They told me: “I think GPT-3 probably pushed other labs in this direction about a year earlier than they otherwise would have. It’s a bit hard to know for sure. There were certainly other groups training larger and larger LMs each few months and they were doing better and better, but it wasn’t obviously clear to everyone that scale was the main ingredient there.” This isn’t a direct claim about when GPT-3 was expected to arrive, but their statement suggests that if GPT-3 was published 1 year later, then that would be more in line with the expectations of the field. As with the other estimates, I will put a confidence interval of +/- 10 months either side of this 12-month estimate. So my third estimate is May 2021 (90% CI: July 2020–March 2022). Since this is based on an off-hand comment from one expert, I only put 20% weight on it.
I put my three estimates together in a weighted average using this Guesstimate model and obtained an overall estimated delay of 11 months (90% CI: 5 to 17 months), or an estimated date of April 2021 (90% CI: October 2020 to October 2022). Note that the confidence interval does not account for the correlation between the confidence intervals of the individual estimates, and the correlation between the first and second estimates (due to using the same data and trend), so it probably should be wider to reflect my true confidence.
What this overall estimate implies is that GPT-3 arrived significantly earlier than expected. I think that the most likely reason for this unexpected event is OpenAI simply being willing and able to invest in a larger amount of compute. The “willing” part is probably the key factor in OpenAI getting to this amount of compute before other leading language model developers just prior to GPT-3’s release, especially Google.
This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
To be clear, only 7 of these 9 GPT-3-like models are in my 9 full case studies; 2 models in my case studies do not meet my definition of GPT-3-like.
Note that this is not a fair comparison with talent holistically. Talent can be the key bottleneck even when salaries are only a small fraction of project costs, due to the time and financial cost of producing enough people with the requisite skills. Further analysis of the holistic talent cost seems worthwhile in future work.
Sponsorship of compute resources could involve an actor doing any of the following things: (a) giving another actor ownership of compute hardware, (b) giving another actor access to compute hardware, (c) giving another actor money that can only be used on compute, or (d) giving another actor money with the intention that it is used for compute. Only cases (b) and (c) occurred in my case studies.
E.g., Beijing Academy of Artificial Intelligence (BAAI) and Peng Cheng Laboratory (PCL) were involved in the GLM-130B and ERNIE 3.0 Titan models respectively. See my survey of models covered previously for details.
I won’t make the effort to detail all these insights, but note that the Gopher paper (Rae et al., 2021) is titled "Scaling Language Models: Methods, Analysis & Insights from Training Gopher”.
I assessed which models are GPT-3-like in a previous post. The nine GPT-3-like models are Gopher, Hyperclova, Jurassic-1-Jumbo, Megatron-Turing NLG, LaMDA-PT, Yuan 1.0, ERNIE 3.0 Titan, Chinchilla, and PaLM.
In a previous post, I estimated that 1000 (90% CI: 200–3000) people could be eligible to access the model weights of OPT-175B, and all of these people could be granted access in the first year following release of OPT-175B. I don’t know what number of people are actually permitted to access OPT-175B so far (i.e., who’ve requested and been granted permission) and it’s very likely lower than the number of people that could be eligible, but as of November 2022 I think that number is more than 80% likely to be higher than 73, which is the total of “core team size” for the models that I estimated “core team size” for (see this cell of the diffusion database).
See Wiblin and Harris (2022): Rob Wiblin: “Are there any historical case studies of information leaks in ML? Are there any cases where an ML model has been stolen in the past?”. Nova DasSarma: “That’s a great question. I don’t think I can think of one offhand actually. If they have been stolen, then it’s one of those things where they’ve kept hush-hush about it.”
Paraphrasing from personal correspondence: Ben Cottier: “Do you know any examples of hackers accessing ML-related artifacts like datasets, trained models, etc.?” Jeffrey Ladish: “Ram Shankar Siva Kumar from AI Red Team at Microsoft—they used phishing to steal a model etc. That's the only example I know of.” I found Field (2022) related to what Jeffrey Ladish was referring to. This isn’t a “real world case of ML model theft” in that it was a red-teaming exercise and didn’t actually result in diffusion to unauthorized parties.
This estimated delay is explained in the section on publicity.
I think doing this in four months would probably be feasible, based on my estimates of training wall-clock time and total project duration (i.e., time until having the trained model; this excludes time for writing and publishing a paper) in the diffusion database. The case with the most confident estimates is OPT-175B, with a total project duration of 78 days, including 33 days of training time. However, there were four months from OPT-175B completing training to the paper being published in May 2022. So my estimate of one month to evaluate the model and publish is probably too short.
Geoffrey Irving (Safety Researcher at DeepMind) told me that “[People who worked on Gopher] had already started LLM scaleup for the purpose of using them for communication and recursion-based alignment schemes soon after I joined [DeepMind, from OpenAI, in October 2019], but GPT-3 did add an organizational push.”
See Shelvane (2022). A senior member of OpenAI (who is specified on p.27 of the PDF) told the author: “GPT-3 existed for a long time before the paper came out. We delayed the paper. [...] But it’s months, it doesn't really count. And you're sitting there, fucking white-knuckling it, because it's really costly if someone releases their paper, and you have fucked this up somehow. So you're under pressure” (p.66 of the PDF).
This is just a rough estimate, and expecting a result to be published by a certain date does not guarantee that no other equivalent model would have been published otherwise. Nonetheless, it is evidence in the direction of “multiple discovery was not involved in any cases of GPT-3-like model diffusion”.
Full correspondence is available here upon request.
My thinking on this is generally informed by Ladish and [lennart] (2022).
I focus on development rather than access to GPT-3-like models here because I think development is more important. See a previous post for my reasoning on this.
In my case studies there is a close relationship between the factors for diffusion and the resources that drive capabilities (i.e., money, compute, data, and talent). I think this is due to replication and incremental research being the main mechanisms of diffusion for 2 years. The actors involved had to actually develop models independently in order for the models to diffuse, because there weren't any open-source models for a while. But if the main diffusion mechanism happened to be espionage, then an accelerating factor might be the poor information security at an organization. So the factors for diffusion and the resources that drive capabilities can be quite separate.
This is because OPT-175B allows more people to get direct access to its model weights, and finding model weights seems to be the most compute-intensive aspect of AI development/deployment.
See the “Training cost (2022 USD)” column of the diffusion database, noting which models are classified as GPT-3-like in the “GPT-3-like model?” column. Some GPT-3-like models in the database do not have cost estimates, but seem very likely to fall within the $1–10M cost range given their training compute (see the “Training compute (FLOPs)” column).
Note that this is not a fair comparison with talent holistically. Talent can be the key bottleneck even when salaries are only a small fraction of project costs, due to the time and financial cost of producing enough people with the requisite skills. Further analysis of the holistic talent cost seems worthwhile in future work.
See Abstract of Zeng et al. (2021)
My conversation notes with Sid Black are available upon request.
Black indicated this rough 40–50% confidence after seeing a draft of this text (which included my skepticism about Black’s claim). Black originally told me (paraphrasing from conversation) that “We did kinda become bottlenecked by compute—if CoreWeave had offered more GPUs, we probably could have [replicated GPT-3].” I interpreted the word “probably” to be more than 50% confidence.
See this section for PanGu-alpha and this section for BLOOM in an appendix.
See Shelvane (2022, p. 73): “The greatest bottleneck has been getting access to enough compute. Initially Eleuther was still using Google’s TFRC scheme. This was not sufficient…”
Shelvane (2022, p. 73): “[CoreWeave] planned to buy more NVIDIA GPUs and rent them out to people training large models. Connor told me: ‘So, the deal was: we test the hardware, we figure out what do you need to train these kinds of models . . . because they don't have in-house capacity ML engineering talent. And then they buy [the hardware]. We get to train our model on it and release it for free. And everyone's happy.”
Shelvane (2022, p. 40): “I asked Aaron [one of the Brown University graduate students that did a project replicating GPT-2] what value the Google’s TFRC team would have seen in the project: ‘To test the systems, and just like...They just want to get more papers out there on it that can only be done on TPUs, because if you’re a company and you want to iterate on that for your own personal thing then you have to pay them to use TPUs. That’s basically it—that’s basically the value in general.’”
Sponsorship may also be important in the sense that it increases the number of people working on larger-scale AI projects, which may increase the number and expertise of AI engineers and researchers, which may then get hired by the leading AI labs.
On p.2 of the paper it says “We open source our code along with the training and evaluation pipelines at https://github.com/megatron-lm”. That link is broken, but version 4 of the paper (Shoeybi, 2020) changes the link to https://github.com/nvidia/megatron-lm, so I assume that these links correspond to the same codebase which has been updated over time.
On p.3 of the Megatron paper it says “Our work focuses on architectures similar to GPT-2.”
The paper’s Abstract page on arXiv says “Our code is open sourced at this https URL,” which links to the Megatron-LM GitHub repository.
See p.41 of the PDF.
See the “Specialised software tools used for development” column in the diffusion database.
See this appendix for my reasoning.
See Shelvane (2022, Ch 2 p. 3 or p. 66): “In addition to delaying the paper, another strategy was to write the paper in a way that avoids attention-grabbing. The paper was written so as to avoid ‘hype’ and include discussion of the model’s weaknesses.”
Another interesting aspect of the search trend is the regions. China was the region with the highest fraction of total searches; 2nd was the interest in South Korea at 34% of China’s, and ranking 17th was the US at 11% of China’s. However, note that there are many small countries that rank highly because the metric used is the fraction of total searches in the given region.
Shelvane (2020, p. 67).
Full correspondence is available upon request. Irving was not clear what exactly is meant by “GPT-3” in that claim—whether it was insider knowledge of GPT-3 before GPT-3 was published, or the publication of the paper, or the huge publicity after publication, or some combination of those events.
Or to produce a close enough replica of that model—the exact weight values of a trained model will always differ between independent training runs.
Note that I haven’t tried to predict how important each type of artifact will be in future diffusion cascades; I leave that to potential future research.
From my limited understanding of the Transformer architecture and how the architecture tends to be scaled up, it is conceivable that learned weights from a smaller model could be copied into a larger model, with the extra weights starting from initial values. But even if it’s possible, I don’t think this would be as effective as training the full-size model from scratch, because I have not heard of this method being used effectively.
This claim is based on all nine of the large language models that I studied in-depth detailing their model architecture and associated hyperparameters—see this column in the diffusion database.
Shelvane (2022, Ch. 2 p.3, or p.66): “Proponents of AGI risk will sometimes criticise OpenAI for contributing too much to advances in AI capabilities [...] It appears that these kinds of considerations did inform the way that GPT-3 was shared. [an OpenAI staff member] told me: ‘GPT-3 existed for a long time before the paper came out. We delayed the paper. That was one of the things we could do for AGI stuff. But it’s months, it doesn't really count.’”
My best guess is that the GPT-3 175B model finished training in October 2019, seven months before publication in May 2020—my reasoning is in the note of this cell of the diffusion database. I guess that the evaluation and paper-writing process took about three months in total, based on my intuition of how long different steps take. I think this is longer than most AI research papers, but the paper is long and seems to have required unusually high effort. That implies a four-month delay in publication.
The Model Card in Appendix B of the paper (p.49) states the "Model Date" is December 2020, and according to the paper that introduces Model Cards this means "When was the model developed?" I interpret “developed” as the date that the model finished training—this interpretation is partly based on another detail from the Gopher paper (Rae et al., 2021): "We trained Gopher for 920 hours in November and December 2020 in Google’s Georgia datacentre." (Appendix F, p.103)
This is based on at least two AI developers at leading AI labs agreeing with me in informal conversation that this does sometimes occur, but I do not have any record of those conversations.
The article states “One of the biggest secrets is the project OpenAI is working on next. Sources described it to me as the culmination of its previous four years of research: an AI system trained on images, text, and other data using massive computational resources.”