Environments as a bottleneck in AGI development

It seems like right now, environments are not remotely the bottleneck. No one failed to solve Go because gosh darn it, they just lacked a good Go simulator which correctly implemented the rules of the game; the limits to solving ALE-57 (like Montezuma's Revenge) in general or as a single multi-task agent do not seem to be lack of levels; Procgen performance is not weak because of insufficient variation in levels; OpenAI Universe failed not for lack of tasks, to say the least; the challenge in replicating GPT-3 is not in scraping the text (and GPT-3 didn't even run 1 epoch!). Datasets/environments sometimes unlock new performance, like ImageNet, but even when one saturates, there's typically more datasets which are not yet solved and cannot be solved simultaneously (JFT-300M, for example), and in the case of RL of course compute=data. If you went to any DRL researcher, I don't think many of them would name "we've solved all the existing environments to superhuman level and have unemployed ourselves!" as their biggest bottleneck.

Is it really the case that at some point we will be drowning in so many GPUs and petaflops that our main problem will become coming up with ever more difficult tasks to give them something useful to train on? Or is this specifically a claim about friendly AGI, where we lack any kind of environment which would seem to force alignment for maximum score?

The Haters Gonna Hate Fallacy

Communication is hard and – importantly – contextual. Most of your readers will be reasonable people

You think this partially because you are not famous or a popular writer.

By the 1% rule of Internet participation, you hear mostly from an extremely self-selected group of critics. You don't hear from the reasonable people, you hear from the unreasonable people. The more popular you get, the more this is true. And there is a lizardman constant going on: there is a fringe of crazy, stubborn readers who will fail to read the most plain and straightforward writing, misinterpret it in the wackiest way, hate you more the better you write, and amplify the craziest things they can find. (At my level of relative obscurity, it's petty stuff: sneers, doxing, death/swatting threats, ML researchers trying to get me fired, FBI visits, that sort of thing. Scott seems to have similar issues, just more so. But by the time you reach Tim Ferriss numbers of readers, this will have escalated to 'attempted kidnappings by organized crime' levels of risk, and he notes that it escalates still further to attempted murder of popular YouTubers etc.)

Combine this with the asymmetry of loss and reward, where criticism hurts a lot more than praise helps, and the more popular you get, the worse you will feel about everything you write or do, regardless of quality.

...Unless you constantly keep in mind: "haters gonna hate". If a criticism doesn't immediately make sense to you or you felt you dealt with it adequately, and it comes from someone you don't already know or trust, then oh well - haters gonna hate. If you're genuinely unsure, run a poll or A/B test or something to hear from a less self-selected sample - but do anything other than naively listening to and believing your critics! That's a luxury permitted only the most obscure or heavily filter-bubbled.

Where is human level on text prediction? (GPTs task)

Might as well finish out this forecasting exercise...

If we assume compute follows the current trend of peak AI project compute doubling every 3.4 months, then 2.2e6× more compute would be log2(2.2e6) = 22 doublings away - or 22*(3.4/12) = 6.3 years, or 2027. (Seems a little unlikely.)

Going the other direction, Hernandez & Brown 2020's estimate is that, net of hardware & algorithmic progress, the cost of a fixed level of performance halves every 16 months; so if GPT-3 cost ~$5m in early 2020, then it'll cost $2.5m around mid-2021, and so on. Similarly, a GPT-human requiring 2.2e6× more compute would presumably cost on the order of $10 trillion in 2020, but after 14 halvings (18 years) would cost $1b in 2038.

Metaculus currently seems to be roughly in between 2027 and 2038 right now, incidentally.

Why GPT wants to mesa-optimize & how we might change this

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

Where is human level on text prediction? (GPTs task)

It's probably a lower bound. These datasets tend to be fairly narrow by design. I'd guess it's more than 2x across all domains globally. And cutting the absolute loss by 50% will be quite difficult. Even increasing the compute by 1000x only gets you about half that under the best-case scenario... Let's see, to continue my WebText crossentropy example, 1000x reduces the loss by about a third, so if you want to halve it (we'll assume that's about the distance to human performance on WebText) from 1.73 to 0.86, you'd need (2.57 * (3.64 * (10^3 * x))^(-0.048)) = 0.86 where x = 2.2e6 or 2,200,000x the compute of GPT-3. Getting 2.2 million times more compute than GPT-3 is quite an ask over the next decade or two.

Where is human level on text prediction? (GPTs task)

Looking more into reported perplexities, the only benchmark which seems to allow direct comparison of human vs GPT-2 vs GPT-3 is LAMBADA.

LAMBADA was benchmarked at a GPT-2 perplexity of 8.6, and a GPT-3 perplexity of 3.0 (zero-shot) & 1.92 (few-shot). OA claims in their GPT-2 blog post (but not the paper) that human perplexity is 1-2, but provides no sources and I couldn't find any. (The authors might be guessing based on how LAMBADA was constructed: examples were filtered by whether two independent human raters provided the same right answer.) Since LAMBADA is a fairly restricted dialogue dataset, although constructed to be difficult, I'd suggest that humans are much closer to 1 than 2 on it.

So overall, it looks like the best guess is that GPT-3 continues to have somewhere around twice the absolute error of a human.

Mati_Roy's Shortform

and error and hyperparameter tuning that would probably increase the cost several-fold.

All of which was done on much smaller models and GPT-3 just scaled up existing settings/equations - they did their homework. That was the whole point of the scaling papers, to tell you how to train the largest cost-effective model without having to brute force it! I think OA may well have done a single run and people are substantially inflating the cost because they aren't paying any attention to the background research or how the GPT-3 paper pointedly omits any discussion of hyperparameter tuning and implies only one run (eg the dataset contamination issue).

Where is human level on text prediction? (GPTs task)

To simplify Daniel's point: the pretraining paradigm claims that language draws heavily on important domains like logic, causal reasoning, world knowledge, etc; to reach human absolute performance (as measured in prediction: perplexity/cross-entropy/bpc), a language model must learn all of those domains roughly as well as humans do; GPT-3 obviously has not learned those important domains to a human level; therefore, if GPT-3 had the same absolute performance as humans but not the same important domains, the pretraining paradigm must be false because we've created a language model which succeeds at one but not the other. There may be a way to do pretraining right, but one turns out to not necessarily follow from the other and so you can't just optimize for absolute performance and expect the rest of it to fall into place.

(It would have turned out that language models can model easier or inessential parts of human corpuses enough to make up for skipping the important domains; maybe if you memorize enough quotes or tropes or sayings, for example, you can predict really well while still failing completely at commonsense reasoning, and this would hold true no matter how much more data was added to the pile.)

As it happens, GPT-3 has not reached the same absolute performance because we're just comparing apples & oranges. I was only talking about WebText in my comment there, but Omohundro is talking about Penn Tree Bank & 1BW. As far as I can tell, GPT-3 is still substantially short of human performance.

Load More