Distributed public goods provision

Is there any connection to quadratic funding?

Environments as a bottleneck in AGI development

"Blessings of scale" observations aside, it seems like right now, environments are not the bottleneck to DL/DRL work. No one failed to solve Go because gosh darn it, they just lacked a good Go simulator which correctly implemented the rules of the game; the limits to solving ALE-57 (like Montezuma's Revenge) in general or as a single multi-task agent do not seem to be lack of Atari games where what we really need is ALE-526*; Procgen performance is not weak because of insufficient variation in levels; OpenAI Universe failed not for lack of tasks, to say the least; the challenge in creating or replicating GPT-3 is not in scraping the text (and GPT-3 didn't even run 1 epoch!). Datasets/environments sometimes unlock new performance, like ImageNet, but even when one saturates, there's typically more datasets which are not yet solved and cannot be solved simultaneously (JFT-300M, for example), and in the case of RL of course compute=data. If you went to any DRL researcher, I don't think many of them would name "we've solved all the existing environments to superhuman level and have unemployed ourselves!" as their biggest bottleneck.

Is it really the case that at some point we will be drowning in so many GPUs and petaflops that our main problem will become coming up with ever more difficult tasks to give them something useful to train on? Or is this specifically a claim about friendly AGI, where we lack any kind of environment which would seem to force alignment for maximum score?

* Apparently the existing ALE suite was chosen pretty haphazardly:

Our testing set was constructed by choosing semi-randomly from the 381 games listed on Wikipedia at the time of writing. Of these games, 123 games have their own Wikipedia page, have a single player mode, are not adult-themed or prototypes, and can be emulated in ALE. From this list, 50 games were chosen at random to form the test set.

I wonder how the history of DRL would've changed if they had happened to select from the other 73, or if Pitfall & Montezuma's Revenge had been omitted? I don't however, think it would've been a good use of their time in 2013 to work on adding more ALE games rather than, say, debugging GPU libraries to make it easier to run NNs at all...

The Haters Gonna Hate Fallacy

Communication is hard and – importantly – contextual. Most of your readers will be reasonable people

You think this partially because you are not famous or a popular writer.

By the 1% rule of Internet participation, you hear mostly from an extremely self-selected group of critics. You don't hear from the reasonable people, you hear from the unreasonable people. The more popular you get, the more this is true. And there is a lizardman constant going on: there is a fringe of crazy, stubborn readers who will fail to read the most plain and straightforward writing, misinterpret it in the wackiest way, hate you more the better you write, and amplify the craziest things they can find. (At my level of relative obscurity, it's petty stuff: sneers, doxing, death/swatting threats, ML researchers trying to get me fired, FBI visits, that sort of thing. Scott seems to have similar issues, just more so. But by the time you reach Tim Ferriss numbers of readers, this will have escalated to 'attempted kidnappings by organized crime' levels of risk, and he notes that it escalates still further to attempted murder of popular YouTubers etc.)

Combine this with the asymmetry of loss and reward, where criticism hurts a lot more than praise helps, and the more popular you get, the worse you will feel about everything you write or do, regardless of quality.

...Unless you constantly keep in mind: "haters gonna hate". If a criticism doesn't immediately make sense to you or you felt you dealt with it adequately, and it comes from someone you don't already know or trust, then oh well - haters gonna hate. If you're genuinely unsure, run a poll or A/B test or something to hear from a less self-selected sample - but do anything other than naively listening to and believing your critics! That's a luxury permitted only the most obscure or heavily filter-bubbled.

Where is human level on text prediction? (GPTs task)

Might as well finish out this forecasting exercise...

If we assume compute follows the current trend of peak AI project compute doubling every 3.4 months, then 2.2e6× more compute would be log2(2.2e6) = 22 doublings away - or 22*(3.4/12) = 6.3 years, or 2027. (Seems a little unlikely.)

Going the other direction, Hernandez & Brown 2020's estimate is that, net of hardware & algorithmic progress, the cost of a fixed level of performance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it'll cost $2.5m around mid-2021, and so on. Similarly, a GPT-human requiring 2.2e6× more compute would presumably cost on the order of $10 trillion in 2020, but after 14 halvings (18 years) would cost $1b in 2038.

Metaculus currently seems to be roughly in between 2027 and 2038 right now, incidentally.

Why GPT wants to mesa-optimize & how we might change this

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

Where is human level on text prediction? (GPTs task)

It's probably a lower bound. These datasets tend to be fairly narrow by design. I'd guess it's more than 2x across all domains globally. And cutting the absolute loss by 50% will be quite difficult. Even increasing the compute by 1000x only gets you about half that under the best-case scenario... Let's see, to continue my WebText crossentropy example, 1000x reduces the loss by about a third, so if you want to halve it (we'll assume that's about the distance to human performance on WebText) from 1.73 to 0.86, you'd need (2.57 * (3.64 * (10^3 * x))^(-0.048)) = 0.86 where x = 2.2e6 or 2,200,000x the compute of GPT-3. Getting 2.2 million times more compute than GPT-3 is quite an ask over the next decade or two.

Where is human level on text prediction? (GPTs task)

Looking more into reported perplexities, the only benchmark which seems to allow direct comparison of human vs GPT-2 vs GPT-3 is LAMBADA.

LAMBADA was benchmarked at a GPT-2 perplexity of 8.6, and a GPT-3 perplexity of 3.0 (zero-shot) & 1.92 (few-shot). OA claims in their GPT-2 blog post (but not the paper) that human perplexity is 1-2, but provides no sources and I couldn't find any. (The authors might be guessing based on how LAMBADA was constructed: examples were filtered by whether two independent human raters provided the same right answer.) Since LAMBADA is a fairly restricted dialogue dataset, although constructed to be difficult, I'd suggest that humans are much closer to 1 than 2 on it.

So overall, it looks like the best guess is that GPT-3 continues to have somewhere around twice the absolute error of a human.

Mati_Roy's Shortform

and error and hyperparameter tuning that would probably increase the cost several-fold.

All of which was done on much smaller models and GPT-3 just scaled up existing settings/equations - they did their homework. That was the whole point of the scaling papers, to tell you how to train the largest cost-effective model without having to brute force it! I think OA may well have done a single run and people are substantially inflating the cost because they aren't paying any attention to the background research or how the GPT-3 paper pointedly omits any discussion of hyperparameter tuning and implies only one run (eg the dataset contamination issue).

Load More