(The title of this post is a joking homage to one of Gary Marcus’ papers.)
I’ve discussed GPT-2 and BERT and other instances of the Transformer architecture a lot on this blog. As you can probably tell, I find them very interesting and exciting. But not everyone has the reaction I do, including some people who I think ought to have that reaction.
Whatever else GPT-2 and friends may or may not be, I think they are clearly a source of fascinating and novel scientific evidence about language and the mind. That much, I think, should be uncontroversial. But it isn’t.
When I was a teenager, I went through a period where I was very interested in cognitive psychology and psycholinguistics. I first got interested via Steven Pinker’s popular books – this was back when Pinker was mostly famous for writing about psychology rather than history and culture – and proceeded to read other, more academic books by authors like Gary Marcus, Jerry Fodor, and John Anderson.
At this time (roughly 2002-6), there was nothing out there that remotely resembled GPT-2. Although there were apparently quite mature and complete formal theories of morphology and syntax, which could accurately answer questions like “is this a well-formed English sentence?”, no one really knew how these could or should be implemented in a physical system meant to understand or produce language.
This was true in two ways. For one thing, no one knew how the human brain implemented this stuff, although apparently it did. But the difficulty was more severe than that: even if you forgot about the brain, and just tried to write a computer program (any computer program) that understood or produced language, the results would be dismal.
At the time, such programs were either specialized academic models of one specific phenomenon – for example, a program that could form the past tense of a verb, but couldn’t do anything else – or they were ostensibly general-purpose but incredibly brittle and error-prone, little more than amusing toys. The latter category included some programs intended as mere amusements or provocations, like the various chatterbots (still about as good/bad as ELIZA after four decades), but also more serious efforts whose reach exceeded their grasp. SYSTRAN spent decades manually curating millions of morphosyntactic and semantic facts for enterprise-grade machine translation; you may remember the results in the form of the good old Babel Fish website, infamous for its hilariously inept translations.
This was all kind of surprising, given that the mature formal theories were right there, ready to be programmed into rule-following machines. What was going on?
The impression I came away with, reading about this stuff as a teenager, was of language as a fascinating and daunting enigma, simultaneously rule-based and rife with endless special cases that stacked upon one another. It was formalism, Jim, but not as we knew it; it was a magic interleaving of regular and irregular phenomena, arising out of the distinctive computational properties of some not-yet-understood subset of brain architecture, which the models of academics and hackers could crudely imitate but not really grok. We did not have the right “language” to talk about language the way our own brains did, internally.
The books I read, back then, talked a lot about this thing called “connectionism.”
This used to be a big academic debate, with people arguing for and against “connectionism.” You don’t hear that term much these days, because the debate has been replaced by a superficially similar but actually very different debate over “deep learning,” in which what used to be good arguments about “connectionism” are repeated in cruder form as bad arguments about “deep learning.”
But I’m getting ahead of myself. What was the old debate about?
As you may know, the pioneers of deep learning had been pioneering it for many years before it went mainstream. What we now call “neural nets” were invented step by step a very long time ago, and very early and primitive neural nets were promoted with far too much zeal as long ago as the 60s.
First there was the “Perceptron,” a single-layer fully-connected network with an update rule that didn’t scale to more layers. It generated a lot of unjustified hype, and was then “refuted” in inimitable petty-academic fashion by Minksy and Papert’s book Perceptrons, a mathematically over-elaborate expression of the simple and obvious fact that no single-layer net can express XOR. (Because no linear classifier can! Duh!)
Then the neural net people came back, armed with “hidden layers” (read: “more than one layer”) trained by “backpropagation” (read: “efficient gradient descent”). These had much greater expressive power, and amounted to a form of nonlinear regression which could learn fairly arbitrary function classes from data.
Some people in psychology became interested in using them as a model for human learning. AFAIK this was simply because nonlinear regression kind of looks like learning (it is now called “machine learning”), and because of the very loose but much-discussed resemblance between these models and the layered architecture of real cortical neurons. The use of neural nets as modeling tools in psychology became known as “connectionism.”
Why was there a debate over connectionism? To opine: because the neural nets of the time (80s to early 90s) really sucked. Weight sharing architectures like CNN and LSTM hadn’t been invented yet; everything was either a fully-connected net or a custom architecture suspiciously jerry-rigged to make the right choices on some specialized task. And these things were being used to model highly regular, rule-governed phenomena, like verb inflection – cases where, even when human children make some initial mistakes, those mistakes themselves have a regular structure.
The connectionist models typically failed to reproduce this structure; where human kids typically err by applying a generic rule to an exceptional case (“I made you a cookie, but I eated it” – a cute meme because an authentically childlike one), the models would err by producing inhuman “blends,” recognizing the exception yet applying the rule anyway (“I ated it”).
There were already good models of correct verb inflection, and generally of correct versions of all these behaviors. Namely, the formal rule systems I referred to earlier. What these systems lacked (by themselves) was a model of learning, of rule-system acquisition. The connectionist models purported to provide this – but they didn’t work.
In 2001, a former grad student of Pinker’s named Gary Marcus wrote an interesting book called The Algebraic Mind: Integrating Connectionism and Cognitive Science. As a teenager, I read it with enthusiasm.
Here is a gloss of Marcus’ position as of this book. Quote-formatted to separate it from the main text, but it’s my writing, not a quote:
The best existing models of many psychological phenomena are formal symbolic ones. They look like math or like computer programs. For instance, they involve general rules containing variables, little “X”s that stand in identically for every single member of some broad domain. (Regular verb inflection takes any X and tacks “-ed” on the end. As Marcus observes, we can do this on the fly with novel words, as when someone talks of a politician who has “out-Gorbacheved Gorbachev.”)
The connectionism debate has conflated at least two questions: “does the brain implement formal symbol-manipulation?” and “does the brain work something like a ‘neural net’ model?” The assumption has been that neural nets don’t manipulate symbols, so if one answer is “yes” the other must be “no.” But the assumption is false: some neural nets really do implement (approximate) symbol manipulation.
This includes some, but not all, of the popular “connectionist” models, despite the fact that any “connectionist” success tends to be viewed as a strike against symbol manipulation. Moreover (Marcus argues), the connectionist nets that succeed as psychological models are the ones that implement symbol manipulation. So the evidence is actually convergent: the best models manipulate symbols, including the best neural net models.
Assuming the brain does do symbol manipulation, as the evidence suggests, what remains to be answered is how it does it. Formal rules are natural to represent in a centralized architecture like a Turing machine; how might they be encoded in a distributed architecture like a brain? And how might these complex mechanisms be reliably built, given only the limited information content of the genome?
To answer these questions, we’ll need models that look sort of like neural nets, in that they use massively parallel arrays of small units with limited central control, and build themselves to do computations no one has explicitly “written out.”
But, to do the job, these models can’t be the dumb generic putty of a fully-connected neural net trained with gradient descent. (Marcus correctly observes that those models can’t generalize across unseen input and output nodes, and thus require innate knowledge to be sneakily baked in to the input/output representations.) They need special pre-built wiring of some sort, and the proper task of neural net models in psychology is to say what this wiring might look like. (Marcus proposes, e.g., an architecture called “treelets” for recursive representations. Remember this was before the popular adoption of CNNs, LSTMs, etc., so this was as much a point presaging modern deep learning as a point against modern deep learning; indeed I can find no sensible way to read it as the latter at all.)
Now, this was all very sensible and interesting, back in the early 2000s. It still is. I agree with it.
What has happened since the early 2000s? Among other things: an explosion of new neural net architectures with more innate structure than the old “connectionist” models. CNNs, LSTMs, recursive networks, memory networks, pointer networks, attention, transformers. Basically all of these advances were made to solve the sorts of problems Marcus was interested in, back in 2001 – to wire up networks so they could natively encode the right kinds of abstractions for human-like generalization, before they saw any data at all. And they’ve been immensely successful!
What’s more, the successes have patterns. The success of GPT-2 and BERT was not a matter of plugging more and more data into fundamentally dumb putty. (I mean, it involved huge amounts of data, but so does human childhood.) The transformer architecture was a real representational advance: suddenly, by switching from one sort of wiring to another sort of wiring, the wired-up machines did way better at language.
Perhaps – as the Gary Marcus of 2001 said – when we look at which neural net architectures succeed in imitating human behavior, we can learn something about how the human brain actually works.
Back in 2001, when neural nets struggled to model even simple linguistic phenomena in isolation, Marcus surveyed 21 (!) such networks intended as models of the English past tense. Here is part of his concluding discussion:
The past tense question originally became popular in 1986 when Rumelhart and McClelland (1986a) asked whether we really have mental rules. Unfortunately, as the proper account of the past tense has become increasingly discussed, Rumelhart and McClelland’s straightforward question has become twice corrupted. Their original question was “Does the mind have rules in anything more than a descriptive sense?” From there, the question shifted to the less insightful “Are there two processes or one?” and finally to the very uninformative “Can we build a connectionist model of the past tense?” The “two processes or one?” question is less insightful because the nature of processes—not the sheer number of processes—is important. […] The sheer number tells us little, and it distracts attention from Rumelhart and McClelland’s original question of whether (algebraic) rules are implicated in cognition.
The “Can we build a connectionist model of the past tense?” question is even worse, for it entirely ignores the underlying question about the status of mental rules. The implicit premise is something like “If we can build an empirically adequate connectionist model of the past tense, we won’t need rules.” But as we have seen, this premise is false: many connectionist models implement rules, sometimes inadvertently. […]
The right question is not “Can any connectionist model capture the facts of inflection?” but rather “What design features must a connectionist model that captures the facts of inflection incorporate?” If we take what the models are telling us seriously, what we see is that those connectionist models that come close to implementing the rule-and-memory model far outperform their more radical cousins. For now, as summarized in table 3.4, it appears that the closer the past tense models come to recapitulating the architecture of the symbolic models – by incorporating the capacity to instantiate variables with instances and to manipulate (here, “copy” and “suffix”) the instances of those variables – the better they perform.
Connectionist models can tell us a great deal about cognitive architecture but only if we carefully examine the differences between models. It is not enough to say that some connectionist model will be able to handle the task. Instead, we must ask what architectural properties are required. What we have seen is that models that include machinery for operations over variables succeed and that models that attempt to make do without such machinery do not.
Now, okay, there is no direct comparison between these models and GPT-2 / BERT. For these models were meant as fine-grained accounts of one specific phenomenon, and what mattered most was how they handled edge cases, even which errors they made when they did err.
By contrast, the popular transformer models are primarily impressive as models of typical-case competence: they sure look like they are following the rules in many realistic cases, but it is less clear whether their edge behavior and their generalizations to very uncommon situations extend the rules in the characteristic ways we do.
And yet. And yet …
In 2001, in the era of my teenage psycho-cognitive-linguistics phase, computers couldn’t do syntax, much less semantics, much less style, tone, social nuance, dialect. Immense effort was poured into simulating comparatively trivial cases like the English past tense in isolation, or making massive brittle systems like Babel Fish, thousands of hours of expert curation leading up to gibberish that gave me a good laugh in 5th grade.
GPT-2 does syntax. I mean, it really does it. It is competent.
A conventionally trained psycholinguist might quibble, asking things like “does it pass the wug test?” I’ve tried it, and the results are … kind of equivocal. So maybe GPT-2 doesn’t respond to probes of edge case behavior the way human children do.
But if so, then so much the worse for the wug test. Or rather: if so, we have learned something about which kinds of linguistic competence are possible in isolation, without some others.
What does GPT-2 do? It fucking writes. Short pithy sentences, long flowing beautiful sentences, everything in between – and almost always well-formed, nouns and verbs agreeing, irregulars correctly inflected, big compositional stacks of clauses lining up just the way they’re supposed to. Gary Marcus was right: you can’t do this with a vanilla fully-connected net, or even with one of many more sophisticated architectures. You need the right architecture. You need, maybe, just maybe, an architecture that can tell us a thing or two about the human brain.
GPT-2 fucking writes. Syntax, yes, and style: it knows the way sentences bob and weave, the special rhythms of many kinds of good prose and of many kinds of distinctively bad prose. Idioms, colloquialisms, self-consistent little worlds of language.
I think maybe the full effect is muted by those services people use that just let you type a prompt and get a continuation back from the base GPT-2 model; with those you’re asking a question that is fundamentally ill-posed (“what is the correct way to finish this paragraph?” – there isn’t one, of course). What’s more impressive to me is fine-tuning on specific texts in conjunction with unconditional generation, pushing the model in the direction of a specific kind of writing and then letting the model work freestyle.
One day I fed in some Vladimir Nabokov ebooks on a whim, and when I came back from work the damn thing was writing stuff that would be good coming from the real Nabokov. In another project, I elicited spookily good, often hilarious and/or beautiful imitations of a certain notorious blogger (curated selections here). More recently I’ve gotten more ambitious, and have used some encoding tricks together with fine-tuning to interactively simulate myself. Speaking as, well, a sort of expert on what I sound like, I can tell you that – in scientific parlance – the results have been trippy as hell.
Look, I know I’m deviating away from structured academic point-making into fuzzy emotive goopiness, but … I like words, I like reading and writing, and when I look at this thing, I recognize something.
These machines can do scores of different things that, individually, looked like fundamental challenges in 2001. They don't always do them “the right way,” by the canons of psycholinguistics; in edge cases they might zig where a human child would zag. But they do things the right way by the canon of me, according to the linguistic competence of a human adult with properly functioning language circuits in his cortex.
What does it mean for psycholinguistics, that a machine exists which can write but not wug, which can run but not walk? It means a whole lot. It means it is possible to run without being able to walk. If the canons of psycholinguistics say this is impossible, so much the worse for them, and so much the better for our understanding of the human brain.
Does the distinctive, oddly simple structure of the transformer bear some functional similarity to the circuit design of, I don’t know, Broca’s area? I have tried, with my great ignorance of actual neurobiology, to look into this question, and I have not had much success.
But if there’s anyone out there less ignorant than me who agrees with the Gary Marcus of 2001, this question should be burning in their mind. PhDs should be done on this. Careers should be made from the question: what do the latest neural nets teach us, not about “AI,” but about the human brain? We are sitting on a trove of psycholinguistic evidence so wonderful and distinctive, we didn’t even imagine it as a possibility, back in the early 2000s.
This is wonderful! This is the food that will feed revolutions in your field! What are you doing with it?
The answer to that question is the real reason this essay exists, and the reason it takes such an oddly irritable tone.
Here is Gary Marcus in 2001:
When I was searching for graduate programs, I attended a brilliant lecture by Steven Pinker in which he compared PDP [i.e. connectionist -nostalgebraist] and symbol-manipulation accounts of the inflection of the English past tense. The lecture convinced me that I needed to work with Pinker at MIT. Soon after I arrived, Pinker and I began collaborating on a study of children’s over-regularization errors (breaked, eated, and the like). Infected by Pinker’s enthusiasm, the minutiae of English irregular verbs came to pervade my every thought.
Among other things, the results we found argued against a particular kind of neural network model. As I began giving lectures on our results, I discovered a communication problem. No matter what I said, people would take me as arguing against all forms of connectionism. No matter how much I stressed the fact that other, more sophisticated kinds of network models [! -nostalgebraist] were left untouched by our research, people always seem to come away thinking, “Marcus is an anti-connectionist.”
But I am not an anti-connectionist; I am opposed only to a particular subset of the possible connectionist models. The problem is that the term connectionism has become synonymous with a single kind of network model, a kind of empiricist model with very little innate structure, a type of model that uses a learning algorithm known as back-propagation. These are not the only kinds of connectionist models that could be built; indeed, they are not even the only kinds of connectionist models that are being built, but because they are so radical, they continue to attract most of the attention.
A major goal of this book is to convince you, the reader, that the type of network that gets so much attention occupies just a small corner in a vast space of possible network models. I suggest that adequate models of cognition most likely lie in a different, less explored part of the space of possible models. Whether or not you agree with my specific proposals, I hope that you will at least see the value of exploring a broader range of possible models. Connectionism need not just be about backpropagation and empiricism. Taken more broadly, it could well help us answer the twin questions of what the mind’s basic building blocks are and how those building blocks can be implemented in the brain.
What is Gary Marcus doing in 2019? He has become a polemicist against “deep learning.” He has engaged in long-running wars of words, on Facebook and twitter and the debate circuit, with a number of “deep learning” pioneers, most notably Yann LeCun – the inventor of the CNN, one of the first big breakthroughs in adding innate structure to move beyond the generalization limits of the bad “connectionist”-style models.
Here is Gary Marcus in September 2019, taking aim at GPT-2 specifically, after citing a specific continuation-from-prompt that flouted common sense:
Current AI systems are largely powered by a statistical technique called deep learning, and deep learning is very effective at learning correlations, such as correlations between images or sounds and labels. But deep learning struggles when it comes to understanding how objects like sentences relate to their parts (like words and phrases).
Why? It’s missing what linguists call compositionality: a way of constructing the meaning of a complex sentence from the meaning of its parts. For example, in the sentence “The moon is 240,000 miles from the Earth,” the word moon means one specific astronomical object, Earth means another, mile means a unit of distance, 240,000 means a number, and then, by virtue of the way that phrases and sentences work compositionally in English, 240,000 miles means a particular length, and the sentence “The moon is 240,000 miles from the Earth” asserts that the distance between the two heavenly bodies is that particular length.
Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure. It can learn that dogs have tails and legs, but it doesn’t know how they relate to the life cycle of a dog. Deep learning doesn’t recognize a dog as an animal composed of parts like a head, a tail, and four legs, or even what an animal is, let alone what a head is, and how the concept of head varies across frogs, dogs, and people, different in details yet bearing a common relation to bodies. Nor does deep learning recognize that a sentence like “The moon is 240,000 miles from the Earth” contains phrases that refer to two heavenly bodies and a length.
“Surprisingly, deep learning doesn’t really have any direct way of handling compositionality.” But the whole point of The Algebraic Mind was that it doesn’t matter whether something implements a symbol-manipulating process transparently or opaquely, directly or indirectly – it just matters whether or not it implements it, full stop.
GPT-2 can fucking write. (BTW, since we’ve touched on the topic of linguistic nuance, I claim the expletive is crucial to my meaning: it’s one thing to merely put some rule-compliant words down on a page and another to fucking write, if you get my drift, and GPT-2 does both.)
This should count as a large quantity of evidence in favor of the claim that, whatever necessary conditions there are for the ability to fucking write, they are in fact satisfied by GPT-2′s architecture. If compositionality is necessary, then this sort of “deep learning” implements compositionality, even if this fact is not superficially obvious from its structure. (The last clause should go without saying to a reader of The Algebraic Mind, but apparently needs explicit spelling out in 2019.)
On the other hand, if “deep learning” cannot do compositionality, then compositionality is not necessary to fucking write. Now, perhaps that just means you can run without walking. Perhaps GPT-2 is a bizarre blind alley passing through an extremely virtuosic kind of simulated competence that will, despite appearances, never quite lead into real competence.
But even this would be an important discovery – the discovery that huge swaths of what we consider most essential about language can be done “non-linguistically.” For every easy test that children pass and GPT-2 fails, there are hard tests GPT-2 passes which the scholars of 2001 would have thought far beyond the reach of any near-future machine. If this is the conclusion we’re drawing, it would imply a kind of paranoia about true linguistic ability, an insistence that one can do so much of it so well, can learn to write like spookily like Nabokov (or like me) given 12 books and 6 hours to chew on them … and yet still not be “the real thing,” not even a little bit. It would imply that there are language-like behaviors out there in logical space which aren’t language and which are nonetheless so much like it, non-trivially, beautifully, spine-chillingly like it.
There is no reading of the situation I can contrive in which we do not learn at least one very important thing about language and the mind.
Who cares about “language and the mind” anymore, in 2019?
I did, as a teenager in the 2000s. Gary Marcus and Steven Pinker did, back then. And I still do, even though – in a characteristically 2019 turn-of-the-tables – I am supposed to be something like an “AI researcher,” and not a psychologist or linguist.
What are the scholars of language and the mind talking about these days? They are talking about AI. They are saying GPT-2 isn’t the “right path” to AI, because it has so many gaps, because it doesn’t look like what they imagined the nice, step-by-step, symbol-manipulating, human-childhood-imitating path to AI would look like.
GPT-2 doesn’t know anything. It doesn’t know that words have referents. It has no common sense, no intuitive physics or psychology or causal modeling, apart from the simulations of these things cheap enough to build inside of a word-prediction engine that has never seen or heard a dog, only the letters d-o-g (and c-a-n-i-n-e, and R-o-t-t-w-e-i-l-e-r, and so forth).
And yet it can fucking write.
The scholars of language and the mind say: “this isn’t ‘the path to AI’. Why, it doesn’t know anything! It runs before it can walk. It reads without talking, speaks without hearing, opines about Obama without ever having gurgled at the mobile posed over its crib. Don’t trust the hype machine. This isn’t ‘intelligence.’”
And I, an “AI researcher,” say: “look, I don’t care about AI. The thing can fucking write and yet it doesn’t know anything! We have a model for like 100 different complex linguistic behaviors, at once, integrated correctly and with gusto, and apparently you can do all that without actually knowing anything or having a world-model, as long as you have this one special kind of computational architecture. Like, holy shit! Stop the presses at MIT Press! We have just learned something incredibly cool about language and the mind, and someone should study it!”
And the scholars of language and the mind go off and debate Yann LeCun and Yoshua Bengio on the topic of whether “deep learning” is enough without incorporating components that look explicitly “symbolic.” Back in 2001, Marcus (correctly) argued that the bad, primitive connectionist architectures of the time often did manipulate symbols, sometimes without their creators realizing it. Now the successors of the “connectionist” models, having experimented with innate structure just like Marcus said they should, can do things no one in 2001 even dreamed of … and somehow, absurdly, we’ve forgotten the insight that a model can be symbolic without looking symbolic. We’ve gone from attributing symbol-manipulation powers to vanilla empiricist models that sucked, to denying those powers to much more nativist models that can fucking write.
What happened? Where did the psycholinguists go, and how can I get them back?
Here is Steven Pinker in 2019, explaining why he is unimpressed with GPT-2′s “superficially plausible gobbledygook”:
Being amnesic for how it began a phrase or sentence, it won’t consistently complete it with the necessary agreement and concord – to say nothing of semantic coherence. And this reveals the second problem: real language does not consist of a running monologue that sounds sort of like English. It’s a way of expressing ideas, a mapping from meaning to sound or text. To put it crudely, speaking or writing is a box whose input is a meaning plus a communicative intent, and whose output is a string of words; comprehension is a box with the opposite information flow.
“Real language does not consist of a running monologue that sounds sort of like English.” Excuse me? Does the English past tense not matter anymore? Is morphosyntax nothing? Style, tone, nuances of diction, tics of punctuation? Have you just given up on studying language qua language the way Chomsky did, just conceded that whole thing to the evil “deep learning” people without saying so?
Aren’t you a scientist? Aren’t you curious? Isn’t this fascinating?
Hello? Hello? Is there anyone in here who can produce novel thoughts and not just garbled regurgitations of outdated academic discourse? Or should I just go back to talking to GPT-2?