Epistemic status: speculative.

First, a few paper titles:

The gist of the first three studies is that transformers (specifically) trained on natural language (specifically) generalize better than expected, with little or no fine-tuning, not only to unseen tasks but even to unseen and apparently unrelated modalities like offline reinforcement learning. The last study takes this a step further—it doesn't actually pretrain on language at all, but instead tries to mimic the specific statistical properties of natural language that lead to this behavior with various sampling procedures from an image classification dataset.

The difference between these results and the plethora of text-to-text transformer multitask/transfer learning results that have come out since GPT-1 is that transfer learning to new modalities requires learning priors general enough to apply to both text and the other modality—implying, first of all, that such priors exist, which has updated me in the following directions:

  • Well-trained transformers, regardless of task, occupy the same relatively small subspace of parameter space
  • Most of the gradient-descent steps of a training run from scratch are spent just getting to this subspace; relatively few are spent learning the specific task

Taken together, these hypotheses seem to imply that within today's gigantic and notoriously data-hungry language models is a sparser, far more efficient architecture trying to get out. I don't have any idea what this architecture looks like. If I did, I wouldn't post about it here. I am quite confident that it exists, because human children manage to acquire language without ingesting the equivalent of terabytes of text. I'm even reasonably confident that it's simple, because the human genome doesn't have enough space to code for complex mental priors (also, the evidence seems to point to the neocortex being fairly uniform), and because whatever “universal grammar” pretrained transformers are learning, it has to be fundamental enough to apply to domains as unlike language as offline reinforcement learning.

Only the last paper of the four I linked above, from DeepMind, attempts to elucidate what's so special about language, and they focus only on a few obvious statistical features of language token distributions—while several features they tested did improve in-context (i.e. few-shot) learning when present, the paper leaves understanding the mechanism behind this improvement for further research.

The most obvious connection that I see here, among the relatively few papers I've read, is with Anthropic's work on In-context Learning and Induction Heads; it seems quite possible that induction heads are this missing mechanism linking the unique properties of language distributions with in-context learning. A direction for further research, for anyone interested, might be to try to find a theoretical link between language-like (Zipfian, non-uniform) training data distributions and the formation of induction heads.

I'll end this here, as my writing has caught up with my thinking; I'll probably write a follow-up if the discussion on this post inspires further ideas.


New Comment
6 comments, sorted by Click to highlight new comments since: Today at 5:27 AM

Thanks for pointing this out!

A few corollaries and alternative conclusions to the same premises:

  1. There are two distinct interesting things here: a magic cross-domain property that can be learned, and an inner architecture that can learn it.
  2. There may be several small efficient architectures. The ones in human brains may not be like the ones in language models. We have plausibly found one efficient architecture; this is not much evidence about unrelated implementations.
  3. Since the learning is transferable to other domains, it's not language specific. Large language models are just where we happened to first build good enough models. You quote discussion of the special properties of natural language statistics but, by assumption, there are similar statistical properties in other domains. The more a property is specific to language, or necessary because of the special properties of language, the less it's likely to be a universal property that transfers to other domains.

I'm seeking some clarification, my reading of your post is that you see the following concepts as intertwined:

  1. Efficient representation of learned information
  2. Efficient learning of information

As you point out (and I agree) that transformer parameters live in a small space and the realities of human biology seem to imply that we can do #1 better, that is, use a "lighter" algorithm with fewer free parameters to store our learned information. 

If I understand you correctly, you believe that this "far more efficient architecture trying to get out" would also be better at #2 (require less data to reach this efficient representation). While I agree that an algorithm to do this better must exist, it is not obvious to me that a better compressed/sparse storage format for language models would necessarily require less data to train. 

So, questions: Did I misunderstand you, and if so, where? Are there additional reasons you believe the two concepts to be correlated?

There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples,  and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a 'new' task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.

I've been thinking along similar lines. And I think that alone these lines lies one possible answer to the apparent need for more data to get more effectively cross-domain general models.

The Lottery Ticket Hypothesis: https://arxiv.org/abs/1803.03635

New to LessWrong?