Transformer language models are doing something more general

Thanks for pointing this out!

A few corollaries and alternative conclusions to the same premises:

There are two distinct interesting things here: a magic cross-domain property that can be learned, and an inner architecture that can learn it.
There may be several small efficient architectures. The ones in human brains may not be like the ones in language models. We have plausibly found one efficient architecture; this is not much evidence about unrelated implementations.
Since the learning is transferable to other domains, it's not language specific. Large language models are just where we happened to first build good enough models. You quote discussion of the special properties of natural language statistics but, by assumption, there are similar statistical properties in other domains. The more a property is specific to language, or necessary because of the special properties of language, the less it's likely to be a universal property that transfers to other domains.

I'm seeking some clarification, my reading of your post is that you see the following concepts as intertwined:

Efficient representation of learned information
Efficient learning of information

As you point out (and I agree) that transformer parameters live in a small space and the realities of human biology seem to imply that we can do #1 better, that is, use a "lighter" algorithm with fewer free parameters to store our learned information.

If I understand you correctly, you believe that this "far more efficient architecture trying to get out" would also be better at #2 (require less data to reach this efficient representation). While I agree that an algorithm to do this better must exist, it is not obvious to me that a better compressed/sparse storage format for language models would necessarily require less data to train.

So, questions: Did I misunderstand you, and if so, where? Are there additional reasons you believe the two concepts to be correlated?

[-]Hastings3y50

There are two ways a large language model transformer learns: type 1, the gradient descent process, which certainly does not learn information efficiently, taking billions of examples, and then type 2, the mysterious in-episode learning process, where a transformer learns from ~ 5 examples in an engineered prompt to do a 'new' task. I think the fundamental question is whether type 2 only works if the task to be learned is represented in the original dataset, or if it generalizes out of distribution. If it truly generalizes, then the obvious next step is to somehow skip straight to type 2 learning.

[-]Nathan Helm-Burger3y10

I've been thinking along similar lines. And I think that alone these lines lies one possible answer to the apparent need for more data to get more effectively cross-domain general models.

[-]michaelklachko3y10

The Lottery Ticket Hypothesis: https://arxiv.org/abs/1803.03635

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

53

Transformer language models are doing something more general

53

53