This post is one part of the sequence Understanding the diffusion of large language models. As context for this post, I strongly recommend reading at least the 5-minute summary of the sequence.
In this post, I:
Table 1: Degree to which algorithmic details and hyperparameter values were disclosed for models in my case studies. See the diffusion database.
Table 2: Degree to which training code, training datasets, and trained models were released for models in my case studies. See the diffusion database.
I’m highly uncertain about the effects of these individual publication decisions, but these are my best guesses:
Table 3: Summary of rationale for release strategies for different models. Click through to the diffusion database for sources and reasoning (which is the basis for my claims in the text that follows, unless stated otherwise). For context, it’s helpful to review the information on the release strategies themselves in the previous section (or just view all the columns side-by-side in the database).
In the cases where models were not released, the apparent incentives included protection of commercial IP, as well as concerns about the misuse and harm that could be driven by diffusion. On the misuse side, it is notable that three of the leading AI developers (if not the three leaders)—Google, DeepMind, and OpenAI—have all based publication decisions partly on misuse concerns.
A key question is Google’s publication strategy going forward. I’m more uncertain about this than for DeepMind and OpenAI. To my knowledge, DeepMind has a comparatively strong history of closed publication (there are notable exceptions, such as AlphaFold, but that was openly published years after the system was first publicized). OpenAI was historically more open, but has made a clear turn to closedness in recent years with GPT-2, GPT-3, and DALL-E. Meanwhile, the teams working on AI at Google seem to be more mixed in recent years:
When I asked Iulia Turc (former Software Engineer at Google Research) about the rationale for not releasing model weights, data, code, etc. for the language models they knew about at Google, they told me: "Ethical concerns were the single most important reason (that's why LaMDA hasn't been released yet). I know this can sometimes sound like PR, but I do think that everyone in my management chain was genuinely worried [about] generative models falling into the wrong hands. Google is very skeptical of including generative models even in its own internal products."
At the same time as actors like DeepMind and OpenAI are adopting a more closed publication strategy for GPT-3-like models, there is a reactive push for greater openness.
In between the concerns about misuse and the push for openness, are middling cases like Jurassic-1-Jumbo. AI21 Labs did not openly publish the Jurassic-1-Jumbo code or trained model weights. However, their stated rationale for making a model API available was along the lines of accelerating AI development and democratizing access to AI technology. The API release could be mostly explained by commercial interest, but that may not fully explain the fact that the API is free to use at an entry level for anyone who signs up.
A general point to emphasize here is that publication strategies vary a lot, and that’s because publication incentives vary a lot. There are several different incentives that cumulatively influence publication decisions:
As an example of how these incentives lead to specific publication decisions for specific actors, consider NVIDIA’s Megatron model (which was not part of my case studies). NVIDIA openly published the code for the Megatron model but did not release the trained model weights. As noted in Shelvane (2022, p. 42), releasing the code is in NVIDIA’s commercial interest: the code helps and encourages people to train language models on NVIDIA hardware. In contrast, releasing the model does not seem to benefit NVIDIA, and may have even harmed their commercial interests by saving some people the trouble of training their own model.
For another example, AI21 Labs’ business is in providing trained AI models as a service. So it seems best for AI21 Labs not to openly publish their models, but it’s in their interest to attract customers by making their model API free to use at the entry level, and discover improvements and new applications by seeing how the API is used.
This research is a project of Rethink Priorities. It was written by Ben Cottier. Thanks to Alexis Carlier, Amanda El-Dakhakhni, Ashwin Acharya, Ben Snodin, Bill Anderson-Samways, Erich Grunewald, Jack Clark, Jaime Sevilla, Jenny Xiao, Lennart Heim, Lewis Ho, Lucy Lim, Luke Muehlhauser, Markus Anderljung, Max Räuker, Micah Musser, Michael Aird, Miles Brundage, Oliver Guest, Onni Arne, Patrick Levermore, Peter Wildeford, Remco Zwetsloot, Renan Araújo, Shaun Ee, Tamay Besiroglu, and Toby Shevlane for helpful feedback. If you like our work, please consider subscribing to our newsletter. You can explore our completed public work here.
Machine learning models can be thought of as a mathematical function. The parts of the function that are learned during training are the weights (AKA parameters). The parts of the function that remain fixed constitute the model architecture. For example, in Transformer models, the self-attention operation is part of the model architecture, but the weights used in that operation are not. See Bloem (2019) for more information about the Transformer architecture.
Hyperparameters are parameters that control how a model learns. For example, the learning rate hyperparameter controls the magnitude of changes in weight values at each training step, and affects the stability and speed of training. There are hyperparameters associated with the training algorithm (such as the learning rate), and hyperparameters associated with the model architecture (such as the number of layers).
See Black et al. (2022, p. 3). Note that GPT-NeoX-20B does not meet my definition of a GPT-3-like model, but I think the point about saving cost is still applicable to larger models.
The datasets for OPT-175B and BLOOM draw from other datasets which are publicly available, but the final full dataset has not been made available for either model.
See Shelvane (2022). From the Abstract: “Structured access [is where] instead of openly disseminating AI systems, developers facilitate controlled, arm's length interactions with their AI systems.”
Credit to Jeffrey Ding and Jenny Xiao for finding this case
See Gao et al. (2020, p. 27)—at the end of the “Deduplication” section it says “In the end, we ran in-memory LSH on a machine with enough RAM for Common Crawl, taking several days.”
See Biderman et al. (2022, p. 6)—“This dataset was created by individuals working on their own time without funding.”
See Biderman et al. (2022, p. 2): “The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Reddit (Baumgartner et al., 2020; Roller et al., 2021).”
This is just a claim about research quantity rather than quality or usefulness.
See Raffel et al. (2019): “To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.”
As an aside, Shelvane (2022, p. 42) noted that AI21 Labs previously “trained a GPT-2-like model and bowed to OpenAI’s [GPT-2] release schedule, even though they did not agree that the risks of GPT-2 warranted this. This was out of ‘respect for our colleagues who have thought hard about these issues.’” AI21 Labs has since collaborated with OpenAI on "Best Practices for Deploying Language Models."
The choice of top three is a somewhat arbitrary number for the sake of concreteness—I’m not sure what the best threshold is, but three seems like a reasonable number to hedge between one actor getting far ahead vs. many actors keeping up with each other.
See e.g. https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search
See Brown et al. (2020, p. 9): “As found in [KMH+20, MKAT18], larger models can typically use a larger batch size, but require a smaller learning rate. We measure the gradient noise scale during training and use it to guide our choice of batch size [MKAT18].”
I started with a prior based on my intuition of 35% (90% CI: 10% to 80%). I interpreted the expert’s view as 10% (90% CI: 1% to 30%). I then weighted the expert’s view to my own at 3:1 (taking a weighted average of the central estimate and each bound individually).
This is based on my own experience with machine learning, and reading various machine learning papers.