(Concrete, easy-to-answer question below, explanation first)

Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.

Elsewhere, based on papers like this and this, various people have extrapolated the following takes:

--It seems like bigger neural nets need to see less data to reach the same level of performance.

--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they'll only need to see each data point once. (Search this for "multiple epochs")

I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3's training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*

I'd be interested to hear people do a bit of a search for the "most sample-efficient/obscure fact" in GPT-3's repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don't have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they've trained big transformers on it? We could answer the question easily and precisely there, no?)

Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven't heard of it before.

*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is "Paul Christiano." (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.

New to LessWrong?

New Answer
New Comment

7 Answers sorted by

The most relevant paper I know of comes out of data privacy concerns. See Extracting Training Data from Large Language Models, which defines "k-eidetic memorization" as a string that can be elicited by some prompt and appears in at most k documents in the training set. They find several examples of k=1 memorization, though the strings appear repeatedly in the source documents. Unfortunately their methodology is targeted towards high-entropy strings and so is not universal.

I have a related question I've been trying to operationalize. How well do GPT-3's memories "generalize"? In other words, given some fact in the training data, how far out of the source distribution can GPT-3 "gain information" from that fact? 

E.g. training: "Ixlthubs live in the water." Test: does this affect the predicted likelihood of "Ixlthubs live in the Pacific"? What about "Ixlthubs cannot survive on land"? I'd consider this another interesting measure of sample efficiency/generalization performance. I'm attempting to put together a proposal for the BigScience project (some set of synthetic facts to sprinkle throughout the data), but it's my first try at something like this and slow going.

This is great, thanks! Then I wonder what people mean, exactly, when they say current methods are sample-inefficient. k=1 memorization seems to be about as good as humans, and this with tiny artificial neural nets! (Even GPT-3 is a thousand times smaller than a human brain).

Your question is super interesting as well. If you make progress on answering it, I'd love to hear!

First pass at trying to answer:

I'm asking GPT-3 questions of the form "Who is X?" to see what it knows. It knows EY, Paul, Katja, Julia, Wei Dai, Kaj Sotala... It thinks Daniel Kokotajlo is a filmmaker, which is true actually (there are two of us in the world, and the more well-known one is the filmmaker). It thinks Evan Hubinger is a software engineer.

In parallel I'm googling those names in quotes to see how many hits they get. To my surprise there is about ten thousand hits for many of these names, the more popular ones get more. But GPT-3's training data didn't contain the whole internet, right? Just a fraction of it? So presumably it had only one thousand, or one hundred, instances of each name to learn from?

Slight subtlety - GPT-3 might have a bias in its training data towards things related to AI and things of interest to the internet (maybe they scraped a lot of forums as well as just google). I picked some random names from non-western countries - for example, this Estonian politician gets 33,000 hits on Google and wasn't recognised by GPT-3. It thought he was a software developer (though from Estonia). Might mean that if you're estimating sample efficiency from Google search hits on people involved with AI, you'll end up overestimating sample efficiency.

What did it say about me? :D I think I tried asking the AI Dungeon version about me at some point but apparently the adventure game finetuning had made that knowledge inaccessible. 

4Daniel Kokotajlo2y
I don't remember what it said the first time, but I just asked it now:
Thanks! LW is sponsored by CFAR so this is kind of correct if you squint a bit
4Daniel Kokotajlo2y
Yeah, I'm counting things as correct if it gets in the right ballpark. Like, I myself didn't know where you worked exactly, but CFAR sounded plausible, especially as a place you may have worked in the past. The fact that GPT-3 said you work at CFAR means it thinks you are part of the rationalist community, which is pretty impressive IMO.

I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3's total sample efficiency for predicting text is poor:

  • To learn to predict text, GPT-3 has to read >1000x as much text as a human can learn in their lifetime.
  • To learn to win at go, AlphaGo has to play >100x times as many games as a human could play in their lifetime.

But on-the-margin, it's very sample efficient at learning to perform new text-related tasks:

  • GPT-3 can learn to perform a new text-related task as easily as a human can.

Essentially, what's happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.

Does that resolve the sense of confusion/mystery, or is there more to it that I'm missing?

That does help, thanks. However, now that I understand better what people are saying, I think it's wrong:

The comparison they are making is as follows:

Pre-trained on 3x10^11 tokens of textPre-trained on 3x10^8 tokens of text (fermi estimate based on WMP 300 so maybe 500 tokens per minute, 10 hours per week reading, 52 weeks a year, over 20 years of life)
Able to read a new fact once or twice and then learn it / remember it.Able to read a new fact once or twice and then learn it / remember it

However, I think this is a bad comparison, because it igno... (read more)

Your comparison does a disservice to the human's sample efficiency in two ways:  1. You're counting diverse data in the human's environment, but you're not comparing their performance on diverse tasks. Human's are obviously better than GPT3 at interactive tasks, walking around, etc. For either kind of fair comparison text data & task, or diverse data & task, the human has far superior sample efficiency. 2. "fancy learning techniques" don't count as data. If the human can get mileage out of them, all the better for the human's sample efficiency. So you seem to have it backwards when you say that the comparison that everyone is making is the "bad" one.
1Daniel Kokotajlo2y
Thanks. Hmmm. I agree with #2, and should edit to clarify. I meant "fancy learning techniques that we could also do with our AIs if we wanted," but maybe I'll just avoid that can of worms for now. For #1: We don't know how well a human-sized artificial neural net would perform if it was trained on the quantity and variety of data that humans have. We haven't done the experiment yet. However, my point is that for all we know it's entirely possible that such a neural net would perform at about human level on all the tasks humans do. The people who are saying that modern neural nets are significantly less sample-efficient than humans are committed to denying this. (Or if they aren't, then I don't know what we are arguing about anymore?) They are committed to saying that we can extrapolate from e.g. GPT-3's performance vs. training data to conclude that we'd need something trained a lot longer than a human (on similar-to-human-lifetime data) to reach human performance. One way they might run this argument is to point out that GPT-3 has already seen more text than any human ever. My reply is that if a human had seen as much text as GPT-3, and only text, nothing else they probably would have poor performance as well, certainly on every task that wasn't a text-based task! Sorry for this oblique response to your point, if it is insufficient I can make a more direct one.
5Steven Byrnes2y
This paper estimates that the human retina conveys visual information to the rest of the brain at 1e7 bits/second. I haven't read the paper though. It's a bit tricky to compare that to pixels anyway, because I think the retina itself does some data compression. I guess we have 6 million cones, which would be ~2M of each type, so maybe vision-at-any-given-time is ballpark comparable to the information content in a 1 megapixel color image??
2Daniel Kokotajlo2y
OK, nice. Edited to fix.

Perhaps GPT-3 has more parameters than are probably needed to roughly memorize its very large training data. This would be good since the data contains some low quality garbage, false claims, etc (can think of them as 'noise'). I believe GPT-n are adding parameters faster than training data Here's my summary of a paper that suggests this is the right move:

https://www.youtube.com/watch?v=OzGguadEHOU Microsoft guy Sebastian Bubeck talking about seemingly overparameterized neural models being necessary for learning (due to label noise?). Validation 'early stopping' of training duration or size scaling is a mistake. after you're over some initial hump that would trigger validation early stopping, overfitting is 'benign' [already known, dubbed 'double descent']. As soon as you can defeat adversarial attacks then you're probably using enough parameters. He (+intern) proves that in order to perfectly memorize the label-noised data set such that small perturbations in the noise don't change predicted output, you need a much larger parameter set than the data set (perfectly memorizing the training data set should be possible within some constant factor of its size). He predicts that ImageNet (image labeling task) could benefit from 10-100 billion parameters instead of the current sub-1-billion.

(obviously GPT- are language models but they can be thought of as having an output which is the masked word or the sentence-before-or-after or whatever they're using to train)

You can get an idea of a pre-trained GPT-3's sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).

This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for "long-range" tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.

I have not finetuned GPT-3, but I have done a lot of finetuning with GPT-J 6.1B, which is similar in scale and performance to GPT-3 "Curie."

In my experience, doing more than a single epoch is always harmful when finetuning GPT-J. 

I initially thought it was beneficial on one specific dataset, but that turned out to be the exception that proves the rule.  I inspected per-token validation loss on that dataset over the course of training, and discovered that the train/val split was imperfect.  Training beyond the first epoch only helped on text ... (read more)

I cannot access your wandb, btw. It seems to be private.
Whoops, fixed.
If 4 is not simply a bad default, maybe they considered more data with a high inferential distance (foreign, non-natural/formal languages), which may require more epochs?

In general a language model will 'know' the sentence related to the single occurrence of a rare name. I don't think you learn much here if there are enough parameters available to support this memory.

The sample efficiency is not a formal claim, like, RL algorithms are claimed to be sample inefficient as only takes 10 games of Pacman to a human get good at it, but we can't isolate this knowledge in human brain. The point a human learns to play Pacman it already learned many things, like GPT-3, and we don't know what things contribute to playing Pacman, is it motor skills? spacial skills? Knowing all the skills that enable human to play Pacman in only ten games and passing this as a pre-training for the RL algorithm then training it to play Pacman would be a fair comparison of how sample efficient it is. The same applies for the names example, could we really measure how many times a human heard a name or maybe a similar name?

10 comments, sorted by Click to highlight new comments since: Today at 5:36 AM

My guess is that the issue of sample efficiency results from equivocation between datasets used for training a model and datasets provided externally. What is the sample efficiency of AlphaZero? It's as bad as anything else if we divide by the datasets generated by amplification, but it's infinitely large if we divide by externally provided datasets, as there are none. The sample efficiency relevant for the cost of training includes the datasets generated by amplification, but in informal comparison with human performance the estimate is about how much the humans observed externally before attaining some level of performance, hence the equivocation.

Similarly if someone figures out amplification for language models (something like debate, but actually works), it can then train on the vastly larger (and better) datasets generated by the model itself, that's only bootstrapped from the external dataset, and so its sample efficiency with respect to the external dataset is going to skyrocket (one issue is that the external dataset is already large, so it's more about quality than quantity, but alternatively this form of training might be able to bootstrap from a much smaller external dataset). So the usual measure of sample efficiency doesn't seem very informative about what's possible with exactly the same learning algorithm after the amplification loop is closed.

For context, I'm interested in questions like "If we had a big transformer that was being fine-tuned as a chatbot with millions of daily conversations, would it be up-to-date on the latest news of the day? What about local news? What about e.g. subculture drama? How often would people have to talk about something for it to be impressed in long-term memory?"

This sounds like something an appropriate amplification may well be able to help the model memorize, even for things mentioned only once, without changing the learning algorithm, at the cost of more training on the auxiliary data generated by the amplification (in this case probably with prompts from the external datasets that need to be combed for rare details).

(I do understand that the question you are asking is about what happens without auxiliary data. I'm commenting on a way accounting for prompt engineering breaks estimates of potential performance of the same learning algorithm. It then becomes an issue of cost, not limitations of the algorithm, in a way that's different from scaling laws.)

Right on. That's a good point. So really I guess the conclusion is: Compute is the bottleneck; an AI chatbot or whatever could totally learn random facts the very first time it encounters them, if you had things set up to amplify that data into some auxiliary dataset and then train on it. Costs a few orders of magnitude more compute perhaps, but gets the job done. Right? (And this could be automated & "smart" in the sense that the AI could decide what stuff to memorize/internalize, what stuff to forget, and what stuff to add to some software database.)

Right. Of course if the sample efficiency of learning improves, the cost goes down, but that's not really crucial for anything. The learning part of AGI is already essentially solved, it just needs to be put into a place where it's getting fed the right data.

"For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community" - Can you say more about this?

Oliver Habryka once told me that he uses GPT-3 to help him create invite lists for events. E.g. "The LessWrong community organized a celebration of Petrov Day. They invited all the prominent Rationalist/EA-adjacent people in the Bay area. Here is the list of people they invited: [Insert list of people he's thought of so far] [GPT-3 continues the list]" Habryka said it's helped him avoid accidentally forgetting people.

There's a typo in the title:

Is GPT-3 is...

If anyone wants to try this with the Pile, you can download a copy of the Pile here and try GPT-J (6B, which is a lot less than GPT3's 175B) here (hosted) or through HF transformers (locally). If you run into any problems you can DM me or ask on the EleutherAI discord.