Staged release

Zach Stein-Perlman

"Staged release" is regularly mentioned as a good thing for frontier AI labs to do. But I've only ever seen one analysis of staged release,^[1] and the term's meaning has changed, becoming vaguer since the GPT-2 era.

This post is kinda a reference post, kinda me sharing my understanding/takes/confusions to elicit suggestions, and kinda a call for help from someone who understands what labs should do on this topic.

OpenAI released the weights of GPT-2 over the course of 9 months in 2019,^[2] and they called this process "staged release." In the context of GPT-2 and releasing model weights, "staged release" means releasing a less powerful version before releasing the full system, and using the intervening time to notice issues (to fix them or inform the full release).

But these days we talk mostly about models that are only released^[3] via API. In this context, "staged release" has no precise definition; it means generally releasing narrowly at first, using narrow release to notice and fix issues. This could entail:

Releasing a weaker version before releasing the full model.
Releasing only to a small number of users at first.
Releasing without full access or for only some applications at first, e.g. disabling fine-tuning and plugins.

...I'm kinda skeptical. These days it's rare for a release to advance the frontier substantially, and for small jumps, there seems to be little need to make the jumps smoother. This is at least true for now, since we have not reached warning signs for dangerous capabilities — when we enter the danger zone where smallish advances in capabilities could enable catastrophic misuse, staged release will be more important.

And if a lab is doing everything else well—risk assessment with evals or red-teaming that successfully elicit any dangerous capabilities, releasing with a safety buffer in case the risk assessment underestimated dangerous capabilities, adversarial robustness, and monitoring model inputs and fine-tuning data for jailbreaks and misuse—staged release seems superfluous. But my impression is no lab is doing everything else very well yet. If the labs are bad at pre-release risk assessment and bad at preventing misuse at inference-time, staged release is more important.

What do I want frontier labs to do on staged release (for closed models, for averting misuse)? I think it's less important than most other asks, but tentatively:

For frontier models,^[4] initially release them without access to fine-tuning or powerful scaffolding. Use narrow release to identify and fix issues. (Or initially give more access only to trusted users / sufficiently few untrusted users that you can monitor them closely (and actually monitor them closely).)

I just made this up; probably there is a better ask on staged release. Suggestions are welcome.

^{^}
Toby Shevlane's dissertation. I don't recommend reading it.
^{^}
From the GPT-2 staged release OpenAI report:
In February 2019, we released the 124 million parameter GPT-2 language model. In May 2019, we released the 355 million parameter model and a dataset of outputs from all four models (124 million, 355 million, 774 million, and 1.5 billion parameters) to aid in training humans and classifiers to detect synthetic text, and assessing biases encoded in GPT-2 generated outputs. In August, we released our 774 million parameter model along with the first version of this report and additional release documentation on GitHub. We are now [in November] releasing our 1.5 billion parameter version of GPT-2 with this updated report and updated documentation.
^{^}
This post mostly-arbitrarily uses "release" and not "deploy." (I believe "deployment" includes use exclusively within the lab while "release" requires external use; in this post we're basically concerned with misuse by actors outside the lab.)
^{^}
Or rather, models that are plausibly nontrivially better for some misuse-related tasks than any other released model.

[-]Buck3mo64

without access to fine-tuning or powerful scaffolding.

Note that normally it's the end user who decides whether they're going to do scaffolding, not the lab. It's probably feasible but somewhat challenging to prevent end users from doing powerful scaffolding (and I'm not even sure how you'd define that).

[-]Zach Stein-Perlman3mo62

Yes but possibly the lab has its own private scaffolding which is better for its model than any other existing scaffolding, perhaps because it trained the model to use its specific scaffolding, and it can initially not allow users to use that.

(Maybe it’s impossible to give API access to scaffolding and keep the scaffolding private? Idk.)

Edit: Plus what David says.

[-]Davidmanheim3mo42

I thought that the point was that either managed-interface-only access, or API access with rate limits, monitoring, and an appropriate terms of service, can prevent use of some forms of scaffolding. If it's staged release, this makes sense to do, at least for a brief period while confirming that there are not security or safety issues.

[-]Davidmanheim3mo20

These days it's rare for a release to advance the frontier substantially.

This seems to be one crux. Sure, there's no need for staged release if the model doesn't actually do much more than previous models, and doesn't have unpatched vulnerabilities of types that would be identified by somewhat broader testing.

The other crux, I think, is around public release of model weights. (Often referred to, incorrectly, as "open sourcing.") Staged release implies not releasing weights immediately - and I think this is one of the critical issues with what companies like X have done that make it important to demand staged release for any models claiming to be as powerful or more powerful than current frontier models. (In addition to testing and red-teaming, which they also don't do.)