Epistemic status: speculative.
First, a few paper titles:
- Pretrained Transformers as Universal Computation Engines
- Can Wikipedia Help Offline Reinforcement Learning? (Answer: yes.)
- Pretrained Transformers Improve Out-of-Distribution Robustness
- Data Distributional Properties Drive Emergent In-Context Learning in Transformers
The gist of the first three studies is that transformers (specifically) trained on natural language (specifically) generalize better than expected, with little or no fine-tuning, not only to unseen tasks but even to unseen and apparently unrelated modalities like offline reinforcement learning. The last study takes this a step further—it doesn't actually pretrain on language at all, but instead tries to mimic the specific statistical properties of natural language that lead to this behavior with various sampling procedures from an image classification dataset.
The difference between these results and the plethora of text-to-text transformer multitask/transfer learning results that have come out since GPT-1 is that transfer learning to new modalities requires learning priors general enough to apply to both text and the other modality—implying, first of all, that such priors exist, which has updated me in the following directions:
- Well-trained transformers, regardless of task, occupy the same relatively small subspace of parameter space
- Most of the gradient-descent steps of a training run from scratch are spent just getting to this subspace; relatively few are spent learning the specific task
Taken together, these hypotheses seem to imply that within today's gigantic and notoriously data-hungry language models is a sparser, far more efficient architecture trying to get out. I don't have any idea what this architecture looks like. If I did, I wouldn't post about it here. I am quite confident that it exists, because human children manage to acquire language without ingesting the equivalent of terabytes of text. I'm even reasonably confident that it's simple, because the human genome doesn't have enough space to code for complex mental priors (also, the evidence seems to point to the neocortex being fairly uniform), and because whatever “universal grammar” pretrained transformers are learning, it has to be fundamental enough to apply to domains as unlike language as offline reinforcement learning.
Only the last paper of the four I linked above, from DeepMind, attempts to elucidate what's so special about language, and they focus only on a few obvious statistical features of language token distributions—while several features they tested did improve in-context (i.e. few-shot) learning when present, the paper leaves understanding the mechanism behind this improvement for further research.
The most obvious connection that I see here, among the relatively few papers I've read, is with Anthropic's work on In-context Learning and Induction Heads; it seems quite possible that induction heads are this missing mechanism linking the unique properties of language distributions with in-context learning. A direction for further research, for anyone interested, might be to try to find a theoretical link between language-like (Zipfian, non-uniform) training data distributions and the formation of induction heads.
I'll end this here, as my writing has caught up with my thinking; I'll probably write a follow-up if the discussion on this post inspires further ideas.
See also https://evjang.com/2021/10/23/generalization.html
Thanks for pointing this out!
A few corollaries and alternative conclusions to the same premises:
I'm seeking some clarification, my reading of your post is that you see the following concepts as intertwined:
As you point out (and I agree) that transformer parameters live in a small space and the realities of human biology seem to imply that we can do #1 better, that is, use a "lighter" algorithm with fewer free parameters to store our learned information.
If I understand you correctly, you believe that this "far more efficient architecture trying to get out" would also be better at #2 (require less data to reach this efficient representation). While I agree that an algorithm to do this better must exist, it is not obvious to me that a better compressed/sparse storage format for language models would necessarily require less data to train.
So, questions: Did I misunderstand you, and if so, where? Are there additional reasons you believe the two concepts to be correlated?
There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples, and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a 'new' task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.
I've been thinking along similar lines. And I think that alone these lines lies one possible answer to the apparent need for more data to get more effectively cross-domain general models.
The Lottery Ticket Hypothesis: https://arxiv.org/abs/1803.03635