(Concrete, easy-to-answer question below, explanation first)
Common adage: Modern deep learning techniques are sample-inefficient; it takes loads of data for them to learn things. If you pre-train them, it takes less additional data for them to learn something new, but still compared to humans it takes a lot.
--It seems like bigger neural nets need to see less data to reach the same level of performance.
--It seems like bigger neural nets need fewer epochs to reach convergence. Soon they'll only need to see each data point once. (Search this for "multiple epochs")
I feel like this take is in tension with the common adage. I wonder: If there is a fact mentioned in GPT-3's training data, how many times does it need to be mentioned before GPT-3 comes to know that fact? For example, I'm told that GPT-3 knows the names of most prominent members of the rationalist community. How many times has it seen each name? Are we talking ten times, or ten thousand?*
I'd be interested to hear people do a bit of a search for the "most sample-efficient/obscure fact" in GPT-3's repertoire. In this manner we could quantity how many times GPT-3 needs to see something before it learns it. (Maybe we don't have access to the dataset used to train GPT-3. But people at Eleuther.ai have The Pile, right? And they've trained big transformers on it? We could answer the question easily and precisely there, no?)
Or am I thinking about this all wrong somehow? This seems like an obvious idea, I wonder why I haven't heard of it before.
*Suppose it is ten thousand. Then that means one in every ten million two-word strings on the internet is "Paul Christiano." (The dataset for GPT-3 was 300B tokens) Add in all the other rationalists/EAs and probably it means one in every hundred thousand words is the name of some prominent rationalist/EA. Surely this is too much, no? It seems way too much according to Google Ngram Viewer.