The most relevant paper I know of comes out of data privacy concerns. See Extracting Training Data from Large Language Models, which defines "k-eidetic memorization" as a string that can be elicited by some prompt and appears in at most k documents in the training set. They find several examples of k=1 memorization, though the strings appear repeatedly in the source documents. Unfortunately their methodology is targeted towards high-entropy strings and so is not universal.
I have a related question I've been trying to operationalize. How well do GPT-3's memories "generalize"? In other words, given some fact in the training data, how far out of the source distribution can GPT-3 "gain information" from that fact?
E.g. training: "Ixlthubs live in the water." Test: does this affect the predicted likelihood of "Ixlthubs live in the Pacific"? What about "Ixlthubs cannot survive on land"? I'd consider this another interesting measure of sample efficiency/generalization performance. I'm attempting to put together a proposal for the BigScience project (some set of synthetic facts to sprinkle throughout the data), but it's my first try at something like this and slow going.
First pass at trying to answer:
I'm asking GPT-3 questions of the form "Who is X?" to see what it knows. It knows EY, Paul, Katja, Julia, Wei Dai, Kaj Sotala... It thinks Daniel Kokotajlo is a filmmaker, which is true actually (there are two of us in the world, and the more well-known one is the filmmaker). It thinks Evan Hubinger is a software engineer.
In parallel I'm googling those names in quotes to see how many hits they get. To my surprise there is about ten thousand hits for many of these names, the more popular ones get more. But GPT-3's training data didn't contain the whole internet, right? Just a fraction of it? So presumably it had only one thousand, or one hundred, instances of each name to learn from?
I think this becomes a lot clearer if we distinguish between total and marginal thinking. GPT-3's total sample efficiency for predicting text is poor:
But on-the-margin, it's very sample efficient at learning to perform new text-related tasks:
Essentially, what's happened is GPT-3 is a kind-of mega-analytical-engine that was really sample inefficient to train up to its current level, but that can now be trained to do additional stuff at relatively little extra cost.
Does that resolve the sense of confusion/mystery, or is there more to it that I'm missing?
Perhaps GPT-3 has more parameters than are probably needed to roughly memorize its very large training data. This would be good since the data contains some low quality garbage, false claims, etc (can think of them as 'noise'). I believe GPT-n are adding parameters faster than training data Here's my summary of a paper that suggests this is the right move:
https://www.youtube.com/watch?v=OzGguadEHOU Microsoft guy Sebastian Bubeck talking about seemingly overparameterized neural models being necessary for learning (due to label noise?). Validation 'early stopping' of training duration or size scaling is a mistake. after you're over some initial hump that would trigger validation early stopping, overfitting is 'benign' [already known, dubbed 'double descent']. As soon as you can defeat adversarial attacks then you're probably using enough parameters. He (+intern) proves that in order to perfectly memorize the label-noised data set such that small perturbations in the noise don't change predicted output, you need a much larger parameter set than the data set (perfectly memorizing the training data set should be possible within some constant factor of its size). He predicts that ImageNet (image labeling task) could benefit from 10-100 billion parameters instead of the current sub-1-billion.
(obviously GPT- are language models but they can be thought of as having an output which is the masked word or the sentence-before-or-after or whatever they're using to train)
You can get an idea of a pre-trained GPT-3's sample efficiency from the GPT-3 fine-tuning API docs. The epoch parameter defaults to 4, and further up in the documentation they recommend fine-tuning with at least 500 examples for 1-2 epochs in the conditional setting (e.g. chatbots). Although training data is often repetitive (implying maybe 2-10x as many effective epochs?), it learns only seeing the data a few times. More evidence of sample efficiency going up with scale you can see in Figure 4.1 in this paper. Sample efficiency also goes up with the amount of data already seen (pre-training).
This suggests that at some scale and some amount of pre-training, we may enter the one-shot learning regime. Then there is no need for "long-range" tricks (RNNs, CNNs, attention) anymore. Instead, one can one-shot learn by backprop while doing the predictions within a relatively short time window.
In general a language model will 'know' the sentence related to the single occurrence of a rare name. I don't think you learn much here if there are enough parameters available to support this memory.
The sample efficiency is not a formal claim, like, RL algorithms are claimed to be sample inefficient as only takes 10 games of Pacman to a human get good at it, but we can't isolate this knowledge in human brain. The point a human learns to play Pacman it already learned many things, like GPT-3, and we don't know what things contribute to playing Pacman, is it motor skills? spacial skills? Knowing all the skills that enable human to play Pacman in only ten games and passing this as a pre-training for the RL algorithm then training it to play Pacman would be a fair comparison of how sample efficient it is. The same applies for the names example, could we really measure how many times a human heard a name or maybe a similar name?