Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://aitracker.org/

TLDR: We've put together a website to track recent releases of superscale models, and comment on the immediate and near-term safety risks they may pose. The website is little more than a view of an Airtable spreadsheet at the moment, but we'd greatly appreciate any feedback you might have on the content. Check it out at aitracker.org.

Longer version:

In the past few months, several successful replications of GPT-3 have been publicly announced. We've also seen the first serious attempts at scaling significantly beyond it, along with indications that large investments are being made in commercial infrastructure that's intended to simplify training the next generation of such models.

Today's race to scale is qualitatively different from previous AI eras in a couple of major ways. First, it's driven by an unprecedentedly tight feedback loop between incremental investment in AI infrastructure, and expected profitability [1]. Second, it's inflected by nationalism: there have been public statements to the effect that a given model will help the developer's home nation maintain its "AI sovereignty" — a concept that would have been alien just a few short years ago.

The replication and proliferation of these models likely poses major risks. These risks are uniquely hard to forecast, not only because many capabilities of current models are novel and might be used to do damage in imaginative ways, but also because the capabilities of future models can't be reliably predicted [2].

AI Tracker

The first step to assessing and addressing these risks is to get visibility into the trends they arise from. In an effort to do that, we've created AI Tracker: a website to catalog recent releases of superscale AI models, and other models that may have implications for public safety around the world.

You can visit AI Tracker at aitracker.org.


Each model in AI Tracker is labeled with several key features: its input and output modalities; its parameter count and total compute cost (where available); its training dataset; its known current and extrapolated future capabilities; and a brief description and industry context, among others. The idea behind the tracker is to highlight these models in the context of the plausible public safety risks they pose, and place them in their proper context as instances of a scaling trend.

(There's also a FAQ at the bottom of the page, if you'd like to know a bit more about our process or motivations.)

Note that we don't directly discuss x-risk in these entries, though we may do so in the future. Right now our focus is on 1) the immediate risks posed by applications of these models, whether from accidental or malicious use; and 2) the near-term risks that would be posed by a more capable version of the current model [3]. These are both necessarily speculative, especially 2).

Note also that we expect we'll be adding entries to AI Tracker retroactively — sometimes the significance of a model is only knowable in hindsight.

Some of the models listed in AI Tracker are smaller in scale than GPT-3, despite having been developed after it. In these cases, we've generally chosen to include the model either because of its modality (e.g., CLIP, which classifies images) or because we believe it has particular implications for capability proliferation (e.g., GPT-J, whose weights have been open-sourced).

AI Tracker is still very much in its early stages. We'll be adding new models, capabilities and trends as they surface. We also expect to improve the interface so you'll be able to view the data in different ways (plots, timelines, etc.).

Tell us how to improve!

We'd love to get your thoughts about the framework we're using for this, and we'd also greatly appreciate any feedback you might have at the object level. Which of our risk assessments look wrong? Which categories didn't we include that you'd like to see? Which significant models did we miss? Are any of our claims incorrect? Do we seem to speak too confidently about something that's actually more uncertain, or vice versa? In terms of the interface (which is very basic at the moment): What's annoying about it? What would you like to be able to do with it, that you currently can't?

For public discussion, please drop a comment below in LW or AF. I — Edouard, that is — will be monitoring the comment section periodically over the next few days and I'll answer as best I can.

If you'd like to leave feedback or request an update on an aspect of the tracker itself (e.g., submit a new model for consideration or point out an error), you can submit feedback directly on the page itself. We plan to credit folks, with their permission, for any suggestions of theirs that we implement.

Finally, if you'd like to reach out to me (Edouard) directly, you can always do so by email: [my_first_name]@mercurius.ai.


[1] This feedback loop isn't perfectly tight at the margin, since currently there's still a meaningful barrier to entry to train superscale models, both in terms of engineering resources and of physical hardware. But even that barrier can be cleared by many organizations today, and it will likely disappear entirely once the necessary training infrastructure gets abstracted into a pay-per-use cloud offering.

[2] As far as I know, at least. If you know of anyone who's been able to correctly predict the capabilities of a 10x scale model from the capabilities of the corresponding 1x scale model, please introduce us!

[3] Of course, it's not really practical to define "more capable version of the current model" in any precise way that all observers will agree on. But you can think of this approximately as, "take the current model's architecture, scale it by 2-10x, and train it to ~completion." It probably isn't worth the effort to sharpen this definition much further, since most of the uncertainty about risk comes from our inability to predict the qualitative capabilities of models at these scales anyway.


 

64

Ω 29

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 11:24 AM

This seems like a great resource. I also like the way it’s presented. It’s very clean.

I’d appreciate more focus on the monetary return on investment large models provide their creators. I think that’s the key metric that will determine how far firms scale up these large models. Relatedly, I think it’s important to track advancements that improve model/training efficiency because they can change the expected ROI for further scaling models.

Thanks for the kind words and thoughtful comments.

You're absolutely right that expected ROI ultimately determines scale of investment. I agree on your efficiency point too: scaling and efficiency are complements, in the sense that the more you have of one, the more it's worth investing in the other.

I think we will probably include some measure of efficiency as you've suggested. But I'm not sure exactly what that will be, since efficiency measures tend to be benchmark-dependent so it's hard to get apples-to-apples here for a variety of reasons. (e.g., differences in modalities, differences in how papers record their results, but also the fact that benchmarks tend to get smashed pretty quickly these days, so newer models are being compared on a different basis from old ones.) Did you have any specific thoughts about this? To be honest, this is still an area we are figuring out.

On the ROI side: while this is definitely the most important metric, it's also the one with by far the widest error bars. The reason is that it's impossible to predict all the creative ways people will use these models for economic ends — even GPT-3 by itself might spawn entire industries that don't yet exist. So the best one could hope for here is something like a lower bound with the accuracy of a startup's TAM estimate: more art than science, and very liable to be proven massively wrong in either direction. (Disclosure: I'm a modestly prolific angel investor, and I've spoken to — though not invested in — several companies being built on GPT-3's API.)

There's another reason we're reluctant to publish ROI estimates: at the margin, these estimates themselves bolster the case for increased investment in scaling, which is concerning from a risk perspective. This probably wouldn't be a huge effect in absolute terms, since it's not really the sort of thing effective allocators weigh heavily as decision inputs, but there are scenarios where it matters and we'd rather not push our luck.

Thanks again!

Some ideas for improvements:

The ability to sort by model size etc would be nice. Currently sorting is alphabetical. 

Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like "deep learning" in almost all rows, not very informative). Ideally the most relevant information would be on the initial page without scrolling.

"Date published" and "date trained" can be quite different. Maybe worth including the latter?

Thanks so much for the feedback!

The ability to sort by model size etc would be nice. Currently sorting is alphabetical. 

Right now the default sort is actually chronological by publication date. I just added the ability to sort by model size and compute budget at your suggestion. You can use the "⇅ Sort" button in the Models tab to try it out; the rows should now sort correctly.

Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like "deep learning" in almost all rows, not very informative). Ideally the most relevant information would be on the initial page without scrolling.

You are absolutely right! I've just taken a shot at rearranging the columns to surface the most relevant parts up front and played around a bit with the sizing. Let me know what you think.

"Date published" and "date trained" can be quite different. Maybe worth including the latter?

That's true, though I've found the date at which a model was trained usually isn't disclosed as part of a publication (unlike parameter count and, to a lesser extent, compute cost). There is also generally an incentive to publish fairly soon after the model's been trained and characterized, so you can often rely on the model not being that stale, though that isn't universal.

Is there a particular reason you'd be interested in seeing training dates as opposed to (or in addition to) publication dates?

Thanks again!

Much better now!

The date published vs date trained was on my mind because of Gopher. It seemed to me very relevant,that Deepmind trained a significantly larger model within basically half a year of the publication of GPT-3. 

In addition to google brain also being quite coy about their 100+B model it made me update a lot in the direction of "the big players will replicate any new breakthrough very quickly but not necessarily talk about it."

To be clear, I also think it probably doesn't make sense to include this information in the list, because it is too rarely relevant. 

It's worth noting that aside from the ridiculous situation where Googlers aren't allowed to name LaMDA (despite at least 5 published papers so far), Google has been very coy about MUM & Pathways (to the point where I'm still not sure if 'Pathways' is an actual model that exists, or merely an aspirational goal/name of a research programme). You also have the situation where models like LG's new 300b Exaone is described in a research paper which makes no mention of Exaone (the Korean coverage briefly mentions the L-Verse arch, but none of the English coverage does), or where we still have little idea what the various Wudao models (the MoE recordholders...?) do. And how about that Megatron-NLG-500b, eh? Is it cool, or not? A blog post, and one paper about how efficiently it can censor tweets, is not much to evaluate it on.

And forget about real evaluation! I'm sure OA's DALL-E successor, GLIDE, is capable of very cool things, which people would find if they could poke at it to establish things like CLIP's* ability to do visual analogies or "the Unreal Engine prompt"; but we'll never know because they aren't going to release it, and if they do, it'll be locked behind an API where you can't do many of the useful things like backpropping through it.

We are much more ignorant about the capabilities of the best models today than we were a year ago.

Increasingly, we're gonna need range/interval notation and survival/extremes analysis to model things, since exact dates, benchmarks, petaflops/days, and parameter counts will be unavailable. Better start updating your data model & graphs now.

* One gets the impression that if OA had realized just how powerful CLIP was for doing more than just zero-shot ImageNet classification or re-ranking DALL-E samples, they probably wouldn't've released the largest models of it. Another AI capabilities lesson: even the creators of something don't always know what it is capable of. "Attacks only get better."

This is an excellent point and it's indeed one of the fundamental limitations of a public tracking approach. Extrapolating trends in an information environment like this can quickly degenerate into pure fantasy. All one can really be sure of is that the public numbers are merely lower bounds — and plausibly, very weak ones.

Yeah, great point about Gopher, we noticed the same thing and included a note to that effect in Gopher's entry in the tracker.

I agree there's reason to believe this sort of delay could become a bigger factor in the future, and may already be a factor now. If we see this pattern develop further (and if folks start publishing "model cards" more consistently like DM did, which gave us the date of Gopher's training) we probably will begin to include training date as separate from publication date. But for now, it's a possible trend to keep an eye on.

Thanks again!

DreamerV2 seems worthy of inclusion to me. In general, it would be great to see older models incorporated as well; I know this has been done before but having it integrated in a live tracker like yours would be super convenient as a one-stop shop of historical context. It would save people from making lots of new lists every time an important new model gets released.

Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?

it would be great to see older models incorporated as well

We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems. And prior to GPT-3, it's hard to find models whose public safety risks were meaningful on a standalone basis — while an earlier model could have been used as part of a malicious act, for example, it wouldn't be as central to such an act as a modern model would be.

Interesting; I hadn't heard of DreamerV2. From a quick look at the paper, it looks like one might describe it as a step on the way to something like EfficientZero. Does that sound roughly correct?

Yes. They don't share a common lineage, but are similar in that they're both recent advances in efficient model-based RL. Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.

We may extend this to older models in the future. But our goal right now is to focus on these models' public safety risks as standalone (or nearly standalone) systems.

I see. If you'd like to visualize trends though, you'll need more historical data points, I think.

Personally speaking, I think this is the subfield to be closely tracking progress in, because 1) it has far-reaching implications in the long term and 2) it has garnered relatively little attention compared to other subfields.

Thanks for the clarification — definitely agree with this.

If you'd like to visualize trends though, you'll need more historical data points, I think.

Yeah, you're right. Our thinking was that we'd be able to do this with future data points or by increasing the "density" of points within the post-GPT-3 era, but ultimately it will probably be necessary (and more compelling) to include somewhat older examples too.

Great idea!