human psycholinguists: a critical appraisal

(The title of this post is a joking homage to one of Gary Marcus’ papers.)

I’ve discussed GPT-2 and BERT and other instances of the Transformer architecture a lot on this blog.  As you can probably tell, I find them very interesting and exciting.  But not everyone has the reaction I do, including some people who I think ought to have that reaction.

Whatever else GPT-2 and friends may or may not be, I think they are clearly a source of fascinating and novel scientific evidence about language and the mind.  That much, I think, should be uncontroversial.  But it isn’t.


When I was a teenager, I went through a period where I was very interested in cognitive psychology and psycholinguistics.  I first got interested via Steven Pinker’s popular books – this was back when Pinker was mostly famous for writing about psychology rather than history and culture – and proceeded to read other, more academic books by authors like Gary Marcus, Jerry Fodor, and John Anderson.

At this time (roughly 2002-6), there was nothing out there that remotely resembled GPT-2.  Although there were apparently quite mature and complete formal theories of morphology and syntax, which could accurately answer questions like “is this a well-formed English sentence?”, no one really knew how these could or should be implemented in a physical system meant to understand or produce language.

This was true in two ways.  For one thing, no one knew how the human brain implemented this stuff, although apparently it did.  But the difficulty was more severe than that: even if you forgot about the brain, and just tried to write a computer program (any computer program) that understood or produced language, the results would be dismal.

At the time, such programs were either specialized academic models of one specific phenomenon – for example, a program that could form the past tense of a verb, but couldn’t do anything else – or they were ostensibly general-purpose but incredibly brittle and error-prone, little more than amusing toys.  The latter category included some programs intended as mere amusements or provocations, like the various chatterbots (still about as good/bad as ELIZA after four decades), but also more serious efforts whose reach exceeded their grasp.  SYSTRAN spent decades manually curating millions of morphosyntactic and semantic facts for enterprise-grade machine translation; you may remember the results in the form of the good old Babel Fish website, infamous for its hilariously inept translations.

This was all kind of surprising, given that the mature formal theories were right there, ready to be programmed into rule-following machines.  What was going on?

The impression I came away with, reading about this stuff as a teenager, was of language as a fascinating and daunting enigma, simultaneously rule-based and rife with endless special cases that stacked upon one another.  It was formalism, Jim, but not as we knew it; it was a magic interleaving of regular and irregular phenomena, arising out of the distinctive computational properties of some not-yet-understood subset of brain architecture, which the models of academics and hackers could crudely imitate but not really grok.  We did not have the right “language” to talk about language the way our own brains did, internally.


The books I read, back then, talked a lot about this thing called “connectionism.”

This used to be a big academic debate, with people arguing for and against “connectionism.”  You don’t hear that term much these days, because the debate has been replaced by a superficially similar but actually very different debate over “deep learning,” in which what used to be good arguments about “connectionism” are repeated in cruder form as bad arguments about “deep learning.”

But I’m getting ahead of myself.  What was the old debate about?

As you may know, the pioneers of deep learning had been pioneering it for many years before it went mainstream.  What we now call “neural nets” were invented step by step a very long time ago, and very early and primitive neural nets were promoted with far too much zeal as long ago as the 60s.

First there was the “Perceptron,” a single-layer fully-connected network with an update rule that didn’t scale to more layers.  It generated a lot of unjustified hype, and was then “refuted” in inimitable petty-academic fashion by Minksy and Papert’s book Perceptrons, a mathematically over-elaborate expression of the simple and obvious fact that no single-layer net can express XOR.  (Because no linear classifier can!  Duh!)

Then the neural net people came back, armed with “hidden layers” (read: “more than one layer”) trained by “backpropagation” (read: “efficient gradient descent”).  These had much greater expressive power, and amounted to a form of nonlinear regression which could learn fairly arbitrary function classes from data.

Some people in psychology became interested in using them as a model  for human learning.  AFAIK this was simply because nonlinear regression kind of looks like learning (it is now called “machine learning”), and because of the very loose but much-discussed resemblance between these models and the layered architecture of real cortical neurons.  The use of neural nets as modeling tools in psychology became known as “connectionism.”

Why was there a debate over connectionism?  To opine: because the neural nets of the time (80s to early 90s) really sucked.  Weight sharing architectures like CNN and LSTM hadn’t been invented yet; everything was either a fully-connected net or a custom architecture suspiciously jerry-rigged to make the right choices on some specialized task.  And these things were being used to model highly regular, rule-governed phenomena, like verb inflection – cases where, even when human children make some initial mistakes, those mistakes themselves have a regular structure.

The connectionist models typically failed to reproduce this structure; where human kids typically err by applying a generic rule to an exceptional case (“I made you a cookie, but I eated it” – a cute meme because an authentically childlike one), the models would err by producing inhuman “blends,” recognizing the exception yet applying the rule anyway (“I ated it”).

There were already good models of correct verb inflection, and generally of correct versions of all these behaviors.  Namely, the formal rule systems I referred to earlier.  What these systems lacked (by themselves) was a model of learning, of rule-system acquisition.  The connectionist models purported to provide this – but they didn’t work.


In 2001, a former grad student of Pinker’s named Gary Marcus wrote an interesting book called The Algebraic Mind: Integrating Connectionism and Cognitive Science.  As a teenager, I read it with enthusiasm.

Here is a gloss of Marcus’ position as of this book.  Quote-formatted to separate it from the main text, but it’s my writing, not a quote:

The best existing models of many psychological phenomena are formal symbolic ones.  They look like math or like computer programs.  For instance, they involve general rules containing variables, little “X”s that stand in identically for every single member of some broad domain.  (Regular verb inflection takes any X and tacks “-ed” on the end.  As Marcus observes, we can do this on the fly with novel words, as when someone talks of a politician who has “out-Gorbacheved Gorbachev.”)

The connectionism debate has conflated at least two questions: “does the brain implement formal symbol-manipulation?” and “does the brain work something like a ‘neural net’ model?”  The assumption has been that neural nets don’t manipulate symbols, so if one answer is “yes” the other must be “no.”  But the assumption is false: some neural nets really do implement (approximate) symbol manipulation.

This includes some, but not all, of the popular “connectionist” models, despite the fact that any “connectionist” success tends to be viewed as a strike against symbol manipulation.  Moreover (Marcus argues), the connectionist nets that succeed as psychological models are the ones that implement symbol manipulation.  So the evidence is actually convergent: the best models manipulate symbols, including the best neural net models.

Assuming the brain does do symbol manipulation, as the evidence suggests, what remains to be answered is how it does it.  Formal rules are natural to represent in a centralized architecture like a Turing machine; how might they be encoded in a distributed architecture like a brain?  And how might these complex mechanisms be reliably built, given only the limited information content of the genome?

To answer these questions, we’ll need models that look sort of like neural nets, in that they use massively parallel arrays of small units with limited central control, and build themselves to do computations no one has explicitly “written out.”

But, to do the job, these models can’t be the dumb generic putty of a fully-connected neural net trained with gradient descent.  (Marcus correctly observes that those models can’t generalize across unseen input and output nodes, and thus require innate knowledge to be sneakily baked in to the input/output representations.)  They need special pre-built wiring of some sort, and the proper task of neural net models in psychology is to say what this wiring might look like.  (Marcus proposes, e.g., an architecture called “treelets” for recursive representations.  Remember this was before the popular adoption of CNNs, LSTMs, etc., so this was as much a point presaging modern deep learning as a point against modern deep learning; indeed I can find no sensible way to read it as the latter at all.)

Now, this was all very sensible and interesting, back in the early 2000s.  It still is.  I agree with it.

What has happened since the early 2000s?  Among other things: an explosion of new neural net architectures with more innate structure than the old “connectionist” models.  CNNs, LSTMs, recursive networks, memory networks, pointer networks, attention, transformers.  Basically all of these advances were made to solve the sorts of problems Marcus was interested in, back in 2001 – to wire up networks so they could natively encode the right kinds of abstractions for human-like generalization, before they saw any data at all.  And they’ve been immensely successful!

What’s more, the successes have patterns.  The success of GPT-2 and BERT was not a matter of plugging more and more data into fundamentally dumb putty.  (I mean, it involved huge amounts of data, but so does human childhood.)  The transformer architecture was a real representational advance: suddenly, by switching from one sort of wiring to another sort of wiring, the wired-up machines did way better at language.

Perhaps – as the Gary Marcus of 2001 said – when we look at which neural net architectures succeed in imitating human behavior, we can learn something about how the human brain actually works.

Back in 2001, when neural nets struggled to model even simple linguistic phenomena in isolation, Marcus surveyed 21 (!) such networks intended as models of the English past tense.  Here is part of his concluding discussion:

The past tense question originally became popular in 1986 when Rumelhart and McClelland (1986a) asked whether we really have mental rules. Unfortunately, as the proper account of the past tense has become increasingly discussed, Rumelhart and McClelland’s straightforward question has become twice corrupted. Their original question was “Does the mind have rules in anything more than a descriptive sense?” From there, the question shifted to the less insightful “Are there two processes or one?” and finally to the very uninformative “Can we build a connectionist model of the past tense?” The “two processes or one?” question is less insightful because the nature of processes—not the sheer number of processes—is important. […] The sheer number tells us little, and it distracts attention from Rumelhart and McClelland’s original question of whether (algebraic) rules are implicated in cognition.

The “Can we build a connectionist model of the past tense?” question is even worse, for it entirely ignores the underlying question about the status of mental rules. The implicit premise is something like “If we can build an empirically adequate connectionist model of the past tense, we won’t need rules.” But as we have seen, this premise is false: many connectionist models implement rules, sometimes inadvertently. […]

The right question is not “Can any connectionist model capture the facts of inflection?” but rather “What design features must a connectionist model that captures the facts of inflection incorporate?” If we take what the models are telling us seriously, what we see is that those connectionist models that come close to implementing the rule-and-memory model far outperform their more radical cousins. For now, as summarized in table 3.4, it appears that the closer the past tense models come to recapitulating the architecture of the symbolic models – by incorporating the capacity to instantiate variables with instances and to manipulate (here, “copy” and “suffix”) the instances of those variables – the better they perform.

Connectionist models can tell us a great deal about cognitive architecture but only if we carefully examine the differences between models. It is not enough to say that some connectionist model will be able to handle the task. Instead, we must ask what architectural properties are required. What we have seen is that models that include machinery for operations over variables succeed and that models that attempt to make do without such machinery do not.

Now, okay, there is no direct comparison between these models and GPT-2 / BERT.  For these models were meant as fine-grained accounts of one specific phenomenon, and what mattered most was how they handled edge cases, even which errors they made when they did err.

By contrast, the popular transformer models are primarily impressive as models of typical-case competence: they sure look like they are following the rules in many realistic cases, but it is less clear whether their edge behavior and their generalizations to very uncommon situations extend the rules in the characteristic ways we do.

And yet.  And yet … 


In 2001, in the era of my teenage psycho-cognitive-linguistics phase, computers couldn’t do syntax, much less semantics, much less style, tone, social nuance, dialect.  Immense effort was poured into simulating comparatively trivial cases like the English past tense in isolation, or making massive brittle systems like Babel Fish, thousands of hours of expert curation leading up to gibberish that gave me a good laugh in 5th grade.

GPT-2 does syntax.  I mean, it really does it.  It is competent.

A conventionally trained psycholinguist might quibble, asking things like “does it pass the wug test?”  I’ve tried it, and the results are … kind of equivocal.  So maybe GPT-2 doesn’t respond to probes of edge case behavior the way human children do.

But if so, then so much the worse for the wug test.  Or rather: if so, we have learned something about which kinds of linguistic competence are possible in isolation, without some others.

What does GPT-2 do?  It fucking writes.  Short pithy sentences, long flowing beautiful sentences, everything in between – and almost always well-formed, nouns and verbs agreeing, irregulars correctly inflected, big compositional stacks of clauses lining up just the way they’re supposed to.  Gary Marcus was right: you can’t do this with a vanilla fully-connected net, or even with one of many more sophisticated architectures.  You need the right architecture.  You need, maybe, just maybe, an architecture that can tell us a thing or two about the human brain.

GPT-2 fucking writes.  Syntax, yes, and style: it knows the way sentences bob and weave, the special rhythms of many kinds of good prose and of many kinds of distinctively bad prose.  Idioms, colloquialisms, self-consistent little worlds of language.

I think maybe the full effect is muted by those services people use that just let you type a prompt and get a continuation back from the base GPT-2 model; with those you’re asking a question that is fundamentally ill-posed (“what is the correct way to finish this paragraph?” – there isn’t one, of course).  What’s more impressive to me is fine-tuning on specific texts in conjunction with unconditional generation, pushing the model in the direction of a specific kind of writing and then letting the model work freestyle.

One day I fed in some Vladimir Nabokov ebooks on a whim, and when I came back from work the damn thing was writing stuff that would be good coming from the real Nabokov.  In another project, I elicited spookily good, often hilarious and/or beautiful imitations of a certain notorious blogger (curated selections here).  More recently I’ve gotten more ambitious, and have used some encoding tricks together with fine-tuning to interactively simulate myself.  Speaking as, well, a sort of expert on what I sound like, I can tell you that – in scientific parlance – the results have been trippy as hell.

Look, I know I’m deviating away from structured academic point-making into fuzzy emotive goopiness, but … I like words, I like reading and writing, and when I look at this thing, I recognize something.

These machines can do scores of different things that, individually, looked like fundamental challenges in 2001.  They don't always do them “the right way,” by the canons of psycholinguistics; in edge cases they might zig where a human child would zag.  But they do things the right way by the canon of me, according to the linguistic competence of a human adult with properly functioning language circuits in his cortex.

What does it mean for psycholinguistics, that a machine exists which can write but not wug, which can run but not walk?  It means a whole lot.  It means it is possible to run without being able to walk.  If the canons of psycholinguistics say this is impossible, so much the worse for them, and so much the better for our understanding of the human brain.


Does the distinctive, oddly simple structure of the transformer bear some functional similarity to the circuit design of, I don’t know, Broca’s area?  I have tried, with my great ignorance of actual neurobiology, to look into this question, and I have not had much success.

But if there’s anyone out there less ignorant than me who agrees with the Gary Marcus of 2001, this question should be burning in their mind.  PhDs should be done on this.  Careers should be made from the question: what do the latest neural nets teach us, not about “AI,” but about the human brain?  We are sitting on a trove of psycholinguistic evidence so wonderful and distinctive, we didn’t even imagine it as a possibility, back in the early 2000s.

This is wonderful!  This is the food that will feed revolutions in your field!  What are you doing with it?


The answer to that question is the real reason this essay exists, and the reason it takes such an oddly irritable tone.

Here is Gary Marcus in 2001:

When I was searching for graduate programs, I attended a brilliant lecture by Steven Pinker in which he compared PDP [i.e. connectionist -nostalgebraist] and symbol-manipulation accounts of the inflection of the English past tense. The lecture convinced me that I needed to work with Pinker at MIT. Soon after I arrived, Pinker and I began collaborating on a study of children’s over-regularization errors (breaked, eated, and the like). Infected by Pinker’s enthusiasm, the minutiae of English irregular verbs came to pervade my every thought.

Among other things, the results we found argued against a particular kind of neural network model. As I began giving lectures on our results, I discovered a communication problem. No matter what I said, people would take me as arguing against all forms of connectionism. No matter how much I stressed the fact that other, more sophisticated kinds of network models [! -nostalgebraist] were left untouched by our research, people always seem to come away thinking, “Marcus is an anti-connectionist.”

But I am not an anti-connectionist; I am opposed only to a particular subset of the possible connectionist models. The problem is that the term connectionism has become synonymous with a single kind of network model, a kind of empiricist model with very little innate structure, a type of model that uses a learning algorithm known as back-propagation. These are not the only kinds of connectionist models that could be built; indeed, they are not even the only kinds of connectionist models that are being built, but because they are so radical, they continue to attract most of the attention.

A major goal of this book is to convince you, the reader, that the type of network that gets so much attention occupies just a small corner in a vast space of possible network models. I suggest that adequate models of cognition most likely lie in a different, less explored part of the space of possible models. Whether or not you agree with my specific proposals, I hope that you will at least see the value of exploring a broader range of possible models. Connectionism need not just be about backpropagation and empiricism. Taken more broadly, it could well help us answer the twin questions of what the mind’s basic building blocks are and how those building blocks can be implemented in the brain.

What is Gary Marcus doing in 2019?  He has become a polemicist against “deep learning.”  He has engaged in long-running wars of words, on Facebook and twitter and the debate circuit, with a number of “deep learning” pioneers, most notably Yann LeCun – the inventor of the CNN, one of the first big breakthroughs in adding innate structure to move beyond the generalization limits of the bad “connectionist”-style models.

Here is Gary Marcus in September 2019, taking aim at GPT-2 specifically, after citing a specific continuation-from-prompt that flouted common sense:

Current AI systems are largely powered by a statistical technique called deep learning, and deep learning is very effective at learning correlations, such as correlations between images or sounds and labels. But deep learning struggles when it comes to understanding how objects like sentences relate to their parts (like words and phrases).

Why? It’s missing what linguists call compositionality: a way of constructing the meaning of a complex sentence from the meaning of its parts. For example, in the sentence “The moon is 240,000 miles from the Earth,” the word moon means one specific astronomical object, Earth means another, mile means a unit of distance, 240,000 means a number, and then, by virtue of the way that phrases and sentences work compositionally in English, 240,000 miles means a particular length, and the sentence “The moon is 240,000 miles from the Earth” asserts that the distance between the two heavenly bodies is that particular length.

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure. It can learn that dogs have tails and legs, but it doesn’t know how they relate to the life cycle of a dog. Deep learning doesn’t recognize a dog as an animal composed of parts like a head, a tail, and four legs, or even what an animal is, let alone what a head is, and how the concept of head varies across frogs, dogs, and people, different in details yet bearing a common relation to bodies. Nor does deep learning recognize that a sentence like “The moon is 240,000 miles from the Earth” contains phrases that refer to two heavenly bodies and a length.

“Surprisingly, deep learning doesn’t really have any direct way of handling compositionality.”  But the whole point of The Algebraic Mind was that it doesn’t matter whether something implements a symbol-manipulating process transparently or opaquely, directly or indirectly – it just matters whether or not it implements it, full stop.

GPT-2 can fucking write.  (BTW, since we’ve touched on the topic of linguistic nuance, I claim the expletive is crucial to my meaning: it’s one thing to merely put some rule-compliant words down on a page and another to fucking write, if you get my drift, and GPT-2 does both.)

This should count as a large quantity of evidence in favor of the claim that, whatever necessary conditions there are for the ability to fucking write, they are in fact satisfied by GPT-2′s architecture.  If compositionality is necessary, then this sort of “deep learning” implements compositionality, even if this fact is not superficially obvious from its structure.  (The last clause should go without saying to a reader of The Algebraic Mind, but apparently needs explicit spelling out in 2019.)

On the other hand, if “deep learning” cannot do compositionality, then compositionality is not necessary to fucking write.  Now, perhaps that just means you can run without walking.  Perhaps GPT-2 is a bizarre blind alley passing through an extremely virtuosic kind of simulated competence that will, despite appearances, never quite lead into real competence.

But even this would be an important discovery – the discovery that huge swaths of what we consider most essential about language can be done “non-linguistically.”  For every easy test that children pass and GPT-2 fails, there are hard tests GPT-2 passes which the scholars of 2001 would have thought far beyond the reach of any near-future machine.  If this is the conclusion we’re drawing, it would imply a kind of paranoia about true linguistic ability, an insistence that one can do so much of it so well, can learn to write like spookily like Nabokov (or like me) given 12 books and 6 hours to chew on them … and yet still not be “the real thing,” not even a little bit.  It would imply that there are language-like behaviors out there in logical space which aren’t language and which are nonetheless so much like it, non-trivially, beautifully, spine-chillingly like it.

There is no reading of the situation I can contrive in which we do not learn at least one very important thing about language and the mind.


Who cares about “language and the mind” anymore, in 2019?

I did, as a teenager in the 2000s.  Gary Marcus and Steven Pinker did, back then.  And I still do, even though – in a characteristically 2019 turn-of-the-tables – I am supposed to be something like an “AI researcher,” and not a psychologist or linguist.

What are the scholars of language and the mind talking about these days?  They are talking about AI.  They are saying GPT-2 isn’t the “right path” to AI, because it has so many gaps, because it doesn’t look like what they imagined the nice, step-by-step, symbol-manipulating, human-childhood-imitating path to AI would look like.

GPT-2 doesn’t know anything.  It doesn’t know that words have referents.  It has no common sense, no intuitive physics or psychology or causal modeling, apart from the simulations of these things cheap enough to build inside of a word-prediction engine that has never seen or heard a dog, only the letters d-o-g (and c-a-n-i-n-e, and R-o-t-t-w-e-i-l-e-r, and so forth).

And yet it can fucking write.

The scholars of language and the mind say: “this isn’t ‘the path to AI’.  Why, it doesn’t know anything!  It runs before it can walk.  It reads without talking, speaks without hearing, opines about Obama without ever having gurgled at the mobile posed over its crib.  Don’t trust the hype machine.  This isn’t ‘intelligence.’”

And I, an “AI researcher,” say: “look, I don’t care about AI.  The thing can fucking write and yet it doesn’t know anything!  We have a model for like 100 different complex linguistic behaviors, at once, integrated correctly and with gusto, and apparently you can do all that without actually knowing anything or having a world-model, as long as you have this one special kind of computational architecture.  Like, holy shit!  Stop the presses at MIT Press!  We have just learned something incredibly cool about language and the mind, and someone should study it!”

And the scholars of language and the mind go off and debate Yann LeCun and Yoshua Bengio on the topic of whether “deep learning” is enough without incorporating components that look explicitly “symbolic.”  Back in 2001, Marcus (correctly) argued that the bad, primitive connectionist architectures of the time often did manipulate symbols, sometimes without their creators realizing it.  Now the successors of the “connectionist” models, having experimented with innate structure just like Marcus said they should, can do things no one in 2001 even dreamed of … and somehow, absurdly, we’ve forgotten the insight that a model can be symbolic without looking symbolic.  We’ve gone from attributing symbol-manipulation powers to vanilla empiricist models that sucked, to denying those powers to much more nativist models that can fucking write.

What happened?  Where did the psycholinguists go, and how can I get them back?

Here is Steven Pinker in 2019, explaining why he is unimpressed with GPT-2′s “superficially plausible gobbledygook”:

Being amnesic for how it began a phrase or sentence, it won’t consistently complete it with the necessary agreement and concord – to say nothing of semantic coherence. And this reveals the second problem: real language does not consist of a running monologue that sounds sort of like English. It’s a way of expressing ideas, a mapping from meaning to sound or text. To put it crudely, speaking or writing is a box whose input is a meaning plus a communicative intent, and whose output is a string of words; comprehension is a box with the opposite information flow. 

“Real language does not consist of a running monologue that sounds sort of like English.”  Excuse me?  Does the English past tense not matter anymore?  Is morphosyntax nothing?  Style, tone, nuances of diction, tics of punctuation?  Have you just given up on studying language qua language the way Chomsky did, just conceded that whole thing to the evil “deep learning” people without saying so?

Aren’t you a scientist?  Aren’t you curious?  Isn’t this fascinating?

Hello?  Hello?  Is there anyone in here who can produce novel thoughts and not just garbled regurgitations of outdated academic discourse?  Or should I just go back to talking to GPT-2?

New Comment
59 comments, sorted by Click to highlight new comments since:

I’m definitely not an expert, but it does sound like a few people moved the goalposts. It sounds very similar to the "AI can't play chess" argument switching to "well, it's not really playing chess, it's just following rules" and then to *quietly mumbling something about it not having the True representation of chess*.

So, I agree with the core point. GPT-2 is writing! And it's writing pretty damn well. Whatever is left has more to do with the general cognition skill than writing.

Promoted to curated: I found this post quite compelling, and also generally think that understanding how people have historically modelled progress in AI (and to what degree we have beaten benchmarks that we thought previously were quite difficult, and how much goalpost-moving there is) is a pretty important aspect of modeling future developments in the field. 

I wrote this post about a year ago.  It now strikes me as an interesting mixture of

  1. Ideas I still believe are true and important, and which are (still) not talked about enough
  2. Ideas that were plausible at the time, but are much less so now
  3. Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time

In category 1 (true, important, not talked about enough):

  • GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
  • Much scholarly ink has been spilled over questions of the form "what would it take, computationally, to do X?" -- where X is something GPT-2 can actually do.  Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
  • Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.

In category 2 (plausible then but not now):

  • "The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried."
    • I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
    • The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
    • The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
    • I now think transformers are just a "good default architecture" for our current compute regime and may not have special linguistic properties
  • I'm finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
    • I now think he's more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.

In category 3 (misleading):

  • I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
    • I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
    • ConvNets do bake in (a very simple kind of) prior knowledge.
    • But, though LSTMs and transformers are more "structured" than fully connected nets, the structure is not intended to encode prior knowledge.
    • Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
    • I was aware of the vast gap between "more structure than the literal minimum possible" and "the kind of structure Marcus wanted," but conflated the two.  Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.

In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.

I think Gary Marcus wanted AI research to uncover lots of interesting rules like "in English, you make verbs past tense by adding -ed, except ..." because he wants to know what the rules are, and because engineering following psycholinguistic research is much more appealing to him than the other way around. Machine learning (without interpretability) doesn't give us any tools to learn what the rules are. 

Maybe add a disclaimer at the start of the post?

Lots of useful ideas here, thanks.

Did you play AI Dungeon yet, by chance?

Playing it was a bit of a revelation for me. It doesn't have to get much better at all to obsolete the whole lower end of formulaic and derivative entertainment...

If compositionality is necessary, then this sort of “deep learning” implements compositionality, even if this fact is not superficially obvious from its structure. 

But compositionality mostly isn't necessary for the kind of writing GPT-2 does. Try getting it to tell you how many legs a dog has. One? Three? Five? It doesn't know, because people rarely write things like "a dog has four legs" in its input data. Here's GPT-2:

A dog has the same number of legs as  a man, but has fewer legs than a gorilla.  It has a lot of brains, but they are divided equally between the two front legs. 

It's very good at coming up with sentences that are grammatically and stylistically correct, but it has no concept of whether they're true. Now, maybe that's just a result of interacting exclusively with language--people generally learn how many legs a dog has by looking at one, not by hearing about them. But even when it comes to purely linguistic issues, it basically doesn't make true statements. This is typified by its habit of contradicting itself (or repeating itself) within the same sentence:

A convincing argument requires certain objects to exist. Otherwise it's not really science. For instance, according to Stromberg, all the atoms in a dog's body exist, and the image of the dog, even though it does not exist, exists (p. 120).

Hmm... I think you are technically right, since "compositionality" is typically defined as a property of the way phrases/sentences/etc. in a language relate to their meanings. Since language modeling is a task defined in terms of words, without involving their referents at all, GPT-2 indeed does not model/exhibit this property of the way languages mean things.

But the same applies identically to every property of the way languages mean things! So if this is really the argument, there's no reason to focus specifically on "compositionality." On the one hand, we would never expect to get compositionality out of any language model, whether a "deep learning" model or some other kind. On the other hand, the argument would fail for any deep learning model that has to connect words with their referents, like one of those models that writes captions for images.

If we read the passage I quoted from 2019!Marcus in this way, it's a trivially true point about GPT-2 that he immediately generalizes to a trivially false point about deep learning. I think when I originally read the passage, I just assumed he couldn't possibly mean this, and jumped to another interpretation: he's saying that deep learning lacks the capacity for structured representations, which would imply an inability to model compositionality even when needed (e.g. when doing image captioning as opposed to language modeling).

Fittingly, when he goes on to describe the problem, it doesn't sound like he's talking about meaning but about having flat rather than hierarchical representations:

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure.

In The Algebraic Mind, Marcus critiqued some connectionist models on the grounds that they cannot support "structured representations." Chapter 4 of the book is called "Structured Representations" and is all about this, mostly focused on meaning (he talks a lot about "structured knowledge") but not at all tied to meaning specifically. Syntax and semantics are treated as equally in need of hierarchical representations, equally impossible without them, and equally possible with them.

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge. So when 2019!Marcus wrote about "compositionality," I assumed that he was making this argument, again, about GPT-2. In that case, GPT-2's proficiency with syntax alone is a relevant datum, because Marcus and conventional linguists believe that syntax alone requires structured representations (as against some of the connectionists, who didn't).

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge.

For what it's worth, I think you're saying the same thing as my critique about concept modeling, if that's what you're referring to.

This was a great read at the time and still holds up. It's one of the rare artifacts that can only produced after a decade or two, which is an account of major shifts in a person's perspective over the course of a decade or two. (In that way it's similar in genre for me as Buck's post in the review.)

It's a very excitingly written history, and gives me insight into the different perspectives on the issue of psycholinguistics, and helps me frame the current situation in AI. I expect to vote on this somewhere between +5 and +7.

What does Nostalgebraist do? They fucking write. (Engaging and educational.)

Having spent some time on the GPT2 Subreddit Simulator watching bots talk like the members of those forums, and having spent a summer caring for my grandmother with dementia, I definitely want to draw chilling parallels between their output.

We've given computers the instinct for talking; now I find myself wondering what would happen if we gave GPT2 ten years of The Wall Street Journal back-issues and then hooked it up to one of those marvelous AIs that runs 401(k) stock market picks.

I broadly agree with the sentiment of this post, that GPT-2 and BERT tell us new things about language. I don't think this claim relies on the fact that they're transformers though - and am skeptical when you say that "the transformer architecture was a real representational advance", and that "You need the right architecture". In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

I think it's misleading to view "amount of inductive bias" as a one-dimensional scale, with the transformer somewhere "between" CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once -- weight sharing between positions, and locality -- and these are two very different things, not just two (perhaps differently sized) injections of "more bias" on our hypothetical 1D bias scale.

For example, locality without weight sharing is certainly conceivable (I can't remember if I've seen it before), but I'd imagine it would do very poorly on text data, because it relaxes the CNN constraint that's appropriate for text while keeping the one that's inappropriate. If you compare that to the transformer, you've got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture's representational aptness for a given domain isn't just a function of some 1D "amount of inductive bias" in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.

As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that "moving to a superset" shouldn't be simplified to "reducing some 1D 'bias' variable," I'd also say that "moving to a superset" isn't what happened anyway.

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

Disagree. Not that this seems deeply impossible or anything, but it's exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there's less incentive to do it.

On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.

I found a recent paper that ran a fun little contest on whether three seq2seq models (LSTMS2S, ConvS2S and Transformer) are "compositional", for various definitions of "compositional": The compositionality of neural networks: integrating symbolism and connectionism. (Answer: Basically yes, especially the Transformer.) That was somewhat helpful, but I still feel like I don't really understand what exactly these models are learning and how (notwithstanding your excellent Transformer blog post), or how their "knowledge" compares with the models built by the more directly brain-inspired wing of ML (example), or for that matter to actual brain algorithms. I need to think about it more. Anyway, thanks for writing this, it's a helpful perspective on these issues.

"You need the right architecture.  You need, maybe, just maybe, an architecture that can tell us a thing or two about the human brain."

I liked this article. I don't think GPT-2 can tell us anything about how the human brain works, though.

Regardless of how well GPT-2 writes, it does not understand language, remotely. I've taken an excerpt from :

"GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text"

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

This reality actually drives one of the key distinctions between human neurology and AI - humans often only need one record in their "training set" in order to begin making wildly accurate predictions, because humans turn that record into a rule that can be immediately generalized, while an AI would be positively useless with so little data.

A good example:

Imagine a human being is introduced to a deer for the first time. They are told that the animal they are looking at is called a "deer". From then on, every single time they see a deer, they will know it's a deer, without any other pieces of data.

In contrast, building an AI that could begin correctly identifying images of deer after being exposed to just one training record, without sacrificing the AIs ability to be as discerning as it needs to be (is that a bird in that brush?), is extraordinarily out of reach at the moment.

EDIT: Also thought I'd point out that GPT-2's efficacy declines substantially as the length of its writing increases (I forget the exact number but after 300 words or something it all goes to mumbo jumbo). That, to me, strongly indicates that GPT-2 is not imitating human neurology at all.


What you describe is correct about GPT-2 and a correct response to the article, but be careful to not over-generalize. There are ways of making AIs with "one-shot learning."

I had a sense I was kind of overstepping when I wrote that...

Do those AI frameworks tend to be very discerning though? I imagine they tend to have high recall and low precision on valid test cases too dissimilar from the single training case.


The Wikipedia article is pretty good on this subject:

I should say first that I completely agree with you about the extreme data inefficiency of many systems that get enthusiastically labeled "AI" these days -- it is a big problem which calls into question many claims about these systems and their displays of "intelligence."

Especially a few years ago (the field has been getting better about this over time), there was a tendency to define performance with reference to some set collection of tasks similar to the training task without acknowledging that broader generalization capacity, and generalization speed in terms of "number of data points needed to learn the general rule," are key components of any intuitive/familiar notion of intelligence. I've written about this in a few places, like the last few sections of this post, where I talk about the "strange simpletons."

However, it's not clear to me that this limitation is inherent to neural nets or to "AI" in the way you seem to be saying. You write:

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

If I understand you correctly, you're taking a position that Marcus argued against in The Algebraic Mind. I'm taking Marcus' arguments there largely as a given in this post, because I agree with them and because I was interested specifically in the way Marcus' Algebraic Mind arguments cut against Marcus' own views about deep learning today.

If you want to question the Algebraic Mind stuff itself, that's fine, but if so you're disagreeing with both me and Marcus more fundamentally than (I think) Marcus and I disagree with one another, and you'll need a more fleshed-out argument if you want to bridge a gulf of this size.

I found this a very valuable history of the field, by you telling the evolution and history of your understanding of the field. This helped me understand linguistics and the import of GPT-2 much better.

I think Gary Marcus and his crowd are largely ridiculous in their criticism of GPT-2. Nobody coming at the situation with open-eyes would deny that this is such a massive leap forward in language models that it makes everything else ever tried look embarrassing. The goalposts have moved so far that the players on the field can't see them without a good telescope.

However, I do think their criticism does highlight some interesting properties of these systems. They're right that you can pose reasoning problems to Transform-based language models that these problems really struggle with (it seems to have a great deal of difficulty counting characters). The architecture also scales poorly to long samples of text because of the GIGO death spiral problem. However, properly trained on a toy dataset, it can do multi-step logical reasoning with a high degree of accuracy (although, it's worth noting, not infinitely long). So it's certainly not entirely incapable of reproducing symbolic reasoning, but it has several major deficits in this respect.

If anything, Transformer-based language models reminds me of some accounts of people with serious but localized brain damage: people who don't have intact mental faculties, but can still speak relatively coherently. I think maybe the best model for the situation is that TBLM architectures are *capable of* logic, but not *well suited to* logic. If you train these things only on logical problems, you can force them to learn to model logic, to some extent, but it's fundamentally kind of an awkward fit. TBLMs are great at sequence problems and just okay at hierarchical reasoning. You can kind of see this in the deficits that remain in the stupidly large 10+ billion parameter models various big tech companies have been training, and their steeply diminishing returns. Some problems don't get solved with just a very big transformer.

It may be that there's some other "brain region" / network architecture needed in order to extend these systems to perform well on text-based problem-solving in general. But if that's where the goal-posts are now, we certainly live in exciting times.

Wait, why is it impossible for a fully-connected network trained by backpropagation to generalize across unseen input and output nodes? Is this supposed to be obvious?

For the full argument from Marcus, read the parts about "training independence" in The Algebraic Mind ch. 2, or in the paper it draws from, "Rethinking Eliminative Connectionism."

The gist is really simple, though. First, note that if some input node is always zero during training, that's equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you'll always multiply by zero when doing the chain rule), so they'll never get updated from their initial values.

Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the "bias." So if the input node is supposed to carry some information, the network still can't learn what it is; it just thinks it's updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that's always constant, or as N such nodes.)

The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it'd work to set the weights to zero and the bias to where is the activation function and is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.

None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.


GPT-2 is not that interesting, sorry. It is not a model of how humans think, so don’t make the mistake of thinking you can draw accurate inferences from it with respect to how humans think without a heck of a lot of caveats. It also doesn’t solve any of the “hard” problems about semantics and speech production. It basically shows that large data and statistics, taken to ridiculous limits, can potentially be used to automate lower level aspects of speech such as word choice. But that’s it really, and even there we don’t have high understanding of how to couple something like GPT-2 to a higher level thinking machine.

This comment does not deserve to be downvoted; I think it's basically correct. GPT-2 is super-interesting as something that pushes the bounds of ML, but is not replicating what goes on under-the-hood with human language production, as Marcus and Pinker were getting at. Writing styles don't seem to reveal anything deep about cognition to me; it's a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.

Writing styles don't seem to reveal anything deep about cognition to me; it's a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.

But isn't it interesting that the way human linguists thought word/punctuation choice worked in humans failed to produce human-like speech, and yet GPT-2 successfully produces human-like speech? Yes, obviously, it's the babbler instead of the full brain. But that definitely lines up with my internal experience, where I have some 'conceptual realm' that hands concepts off to a babbler, which then generates sentences, in a way that lines up with how GPT-2 seems to operate (where I can confidently start a sentence and not know how I'll finish it, and then it's sensible by the time I get there).


That’s not a novel result though. We’ve basically known those aspects of speech to be associative for decades. Indeed it is pretty hard to explain many frequent errors in human speech models without associative generative models. There are some outliers, like Chomsky, who persist in pushing unrealistic models of human speech, but for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked.

Now reducing that assumption to practice is a huge engineering accomplishment which I don’t mean to belittle. But the OP is wondering why linguists are not all infatuated with GPT-2. The answer is that there wasn’t that much to be learned from a theorist perspective. They already assigned >90% probability that GPT-2 models something like how speech production works. So having it reduced to practice isn’t that big of an update, in terms of Bayesian reasoning. It’s just the wheel of progress turning forward.

They already assigned >90% probability that GPT-2 models something like how speech production works.

Is that truly the case? I recall reading Corey Washington a former linguist (who left the field for neuroscience in frustration with its culture and methods) claim that when he was a linguist the general attitude was there was no way in hell something like GPT-2 would ever work even close to the degree that it does.

Found it:

Steve: Corey’s background is in philosophy of language and linguistics, and also neuroscience, and I have always felt that he’s a little bit more pessimistic than I am about AGI. So I’m curious — and answer honestly, Corey, no revisionist thinking — before the results of this GPT-2 paper were available to you, would you not have bet very strongly against the procedure that they went through working?

Corey: Yes, I would’ve said no way in hell actually, to be honest with you.

Steve: Yes. So it’s an event that caused you to update your priors.

Corey: Absolutely. Just to be honest, when I was coming up, I was at MIT in the mid ’80s in linguistics, and there was this general talk about how machine translation just would never happen and how it was just lunacy, and maybe if they listened to us at MIT and took a little linguistics class they might actually figure out how to get this thing to work, but as it is they’re going off and doing this stuff which is just destined to fail. It’s a complete falsification of that basic outlook, which I think, — looking back, of course — had very little evidence — it had a lot of hubris behind it, but very little evidence behind it.

I was just recently reading a paper in Dutch, and I just simply… First of all, the OCR recognized the Dutch language and it gave me a little text version of the page. I simply copied the page, pasted it into Google Translate, and got a translation that allowed me to basically read this article without much difficulty. That would’ve been thought to be impossible 20, 30 years ago — and it’s not even close to predicting the next word, or writing in the style that is typical of the corpus.

Do you know any promising theories of the higher levels of speech production (i.e., human verbal/symbolic reasoning)? That seems to me to be one of the biggest missing pieces at this point of a theoretical understanding of human intelligence (and of AGI theory), and I wonder if there's actually good theoretical work out there that I'm just not aware of.

for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked

Can you be more specific about what you mean by "something like the Transformer model"? Or is there a reference you recommend? I don't think anyone believes that there are literally neurons in the brain wired up into a Transformer, or anything like that, right?

As far as I'm aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word "goose" makes us more likely to think of the word "moose" as well as "duck" (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.

The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase "word choice."

If "word choice" just means something narrow like "selecting which noun you want to use, given that you are picking the inhabitant of a 'slot' in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey," then perhaps priming and other results about perceptions of "word similarity" might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I'm aware) believe GPT-2 works like this.

If "word choice" means something bigger that encompasses syntax, then priming experiments about single words don't tell us much about it.

I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2's stylistic fluency is less surprising than its syntactic fluency. In fact, I think that's true -- intellectually, I am more impressed by the syntax than the style.

But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam's Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.

It's more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena -- as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style "for free" out of a model that also does good syntax (or, if you prefer, good syntax "for free" out of a model that also does good style) suggests we might be scientifically on the right track.


Neither one is surprising to me at all. In fact I don't think there is a sharp divide between syntax and style--syntax is that word which we assign to culturally shared style. That's why we can define specialized syntaxes for dialectal differences. And as a structural rule, syntax/style is very relevant to word choice since it prohibits certain combinations. A big enough network will have a large enough working memory to "keep in mind" enough contextual information to effectively satisfy the syntax rules describing the styles it learned.

Ability to write is probably independent of other skills. Just look at James Joyce's aphasia as reflected in Finnegan's wake. Who would expect anything of intellectual value to come out of a language generator trained on internet news articles?

I wonder how gpt-2 does if it is trained on the contents of arxiv.

I wonder how gpt-2 does if it is trained on the contents of arxiv.

What do you expect it to be like?

More awesome than my puny mind can imagine.

I'd like the raw model to be trained on raw copies of as many mathematical papers and texts as possible, with 'impact factor' used as weights.

I'd also, while I'm dreaming, like to see it trained on only the math, without the prose, and a second model trained to generate the prose of math papers solely from the math contained within.

I think math papers are a better source than reddit news articles because pure mathematics is systematic, and all concepts, at least in theory, can be derived from first principles covered in other places.

Ideally, the system would generate papers with named authors and citation lists that help guide the operators to humans capable of reviewing the work.

If you believe that one single useful mathematical insight could be found with my proposed approach, it's borderline criminal to not devote effort to getting it built.


I don’t believe any mathematical insights could be gained from that approach. That’s not what GPT-2 does. It does not form original concepts: that would require a reasoning and inference engine it lacks. It merely remixes its training data. What you’ll get out is a nonsensical paper that is formatted and structured to look like an academic paper but with content that only takes the appearance of meaning. Basically just lorum ipsum at a higher level.

In a sense it has already been done. There are papers that have been generated with similar tools and submitted to actual reviewed journals—and accepted, although that was more of a proof to show that these journals didn’t actually do any peer review. (The field was philosophy IIRC, not mathematics, but that should be irrelevant to this point.)

I assert that if gpt-2 can write interesting looking poetry, it can probably do interesting mathematics.

I think that there is a wide space between 'boring and useless' and 'groundbreakingly insightful', and that this particular system can generate things in that space.

I think my view here is 'less than cautious' optimism. I am not sure what it takes to justify the expenditure to openai to test this assertion. It sounds like a fairly expensive project (data collection, training time), so better people than I will have to decide to throw money at it, and that decision will be made using criteria that are opaque to me.


It can write pretty looking poetry. It can learn to mimc metric and rhyme, and replay constructs it has seen before, which imitates poetry of a certain taste. It cannot write interesting poetry. That would require conceptual modeling of what it is writing and goal-driven thought processes which it lacks entirely.

It's important to have a technical understanding of these things so that you can separate potentially good ideas from what would ultimately be a waste of time. GPT-2 cannot formulate original concepts. It's just architecturally not possible.

My standard for interesting poetry is clearly different from (inferior to?) yours. If I understand you correctly, I predict that you think artwork created with StyleGAN by definition cannot have artistic merit on its own.

So we appear to be at an impasse. I do not see how you can simultaneously dismiss the value of the system for generating things with artistic merit (like poetry, mathematics, or song lyrics), and simultaneously share the anxieties of the developers about its' apparent effectiveness at generating propaganda.

AI systems have recently surprised people by being unusually good at strange things, I think opimism for a creative profession like pure math is warranted. In short, the potential payoff (contributions to pure math) is massive, the risk is just an amount of money that in this industry is actually fairly small and the egos of people who believe that 'their' creative field (math) could not be conquered by ML models that can only do 'derivative' things.

I assert that at some point in the next two years, there will exist an AI engine which when given the total body of human work in mathematics and a small prompt (like the one used in gpt-2), is capable of generating mathematical works that humans in the field find interesting to read, provided of course that someone bothers to try.

If the estimated cost for actually training the model I described above, and thus ending this discussion, drops below $1000, and it has not been done, I will simply do it.

Edit: gwern actually did an arxiv paper from gpt-3, it was kind of interesting but probably not all that useful. I do not think I'll be putting money up on this anytime soon, but do persist in my belief that within the next year, a mathematician would be able to use AI systems to generate mathematical ideas that are interesting to professionals in the field. I feel that someone with more knowledge than me could probably make the argument that the solution to protein folding represents an achievement in this space.


You are looking for originality. GPT-2 offers remixing. These are not the same thing. I don't know how to say that more clearly.

The disagreement is about whether 'remixing' can result in 'originality'.

We are in agreement about the way gpt-2 works, and the types of outputs it produces, just disagreeing about whether they meet our criteria for 'interesting' or 'original'. I believe that our definitions of those two things necessarily include a judgement call about the way we feel about 'orginality' and 'insight' as a human phenomenon.

Some attempts to explicate this agreement to see if I understand your position:

I argue that this track, which is nothing but a mashup of other music, stands as an interesting creative work in its' own right. I suspect that you disagree, as it is just 'remixing':

I would also believe that gpt-2, properly trained on the whole of the Talmud (and nothing else), with the older stuff prioritized, could probably produce interesting commentary, particular if specific outputs are seeded with statements like 'today this <thing> happened so therefore'.

I think you would ascribe no value to such commentary, due to the source being a robot remixer, rather than a scholar, regardless of the actual words in the actual output text.

If I remember the gpt-2 reddit thread correctly, most comments were trash, some of them made reading the rest of it worthwhile to me.

Just like a 'real' reddit thread.


Don't be confused about the reuse of the same word: 'remix' has a different meaning in applying to the YouTube video that you linked. To create that music video a creative, goal-directed general intelligence (the artist) took some source/training material and assembled an original song by making use of the artist's generative concept modeling. That capability does not exist in the case of GPT-2, whose 'remixing' is more along the lines of replay than reimagine.

To use another art example, traditional animation is performed by skilled artists creating keyframe, and then junior artists doing the 'in-between' interpolative drawing. Making the keyframe requires originality. The in-betweening takes a certain amount of skill and there is some artistry involved, but it is fundamentally still just interpolation and there's a reason the keyframer, not the in-betweener puts their name on the final product.

GPT-2 is doing verbal in-between'ing. You're asking it to make keyframes. I'm saying it's not going to work, at least not any better than monkeys mashing keyboards will generate Shakespeare.

So in your analogy, would the 'seed text' provided to gpt-2 be analogous to a single keyframe, provided to an artist, and gpt-2s output be essentially what happens when you provide an interpolator (I know nothing about the craft of animation and am probably using this word wrong) a 'start' frame but no 'finish' frame?

I would argue that an approach in animation, where a keyframe artist is not sure exactly where to go with a scene, so he draws the keyframe, hands it to interpolative animators with the request to 'start drawing where you think this is going', and looks at the results for inspiration for the next 'keyframe' will probably result in a lot of wasted effort by the interpolators, and is probably inferior (in terms of cost and time) to plenty of other techniques available to the keyframe artist; but also that it has a moderate to high probability of eventually inspiring something useful if you do it enough times.

In that context, I would view the unguided interpolation artwork as 'original' and 'interesting', even though the majority of it would never be used.

Unlike the time spent by animators interpolating, running trained gpt-2 is essentially free. So, in absolute terms, this approach, even if it produces garbage the overwhelming majority of the time, which it will, is moderate to very likely to find interesting approaches with a low, but reasonable for human reviewers, probability (meaning, the human must review dozens of worthless outputs, not hundreds of millions like the monkeys on typewriters).

I suspect that a mathematician with the tool I proposed could type in a thesis, see what emerges, and have a moderate to high probability of eventually encountering some text that inspires something like the following thought: 'well, this is clearly wrong, but I would not have thought to associate this thesis with that particular technique, let me do some work of my own and see if there is anything to this'.

I view the output in that particular example to be 'encountering something interesting', and the probability if it occurring at least once if my proposed tool were to be developed to be moderate to high, and that the cost in terms of time spent reviewing outputs would not be high enough to make the approach have negative value to the proposed user community.

I price the value of bringing this tool into existence in terms of the resources available to me personally as 'worth a bit less than $1000 usd'.

So in your analogy, would the 'seed text' provided to gpt-2 be analogous to a single keyframe, provided to an artist, and gpt-2s output be essentially what happens when you provide an interpolator (I know nothing about the craft of animation and am probably using this word wrong) a 'start' frame but no 'finish' frame?

Precisely. Also, are you familiar with Google's DeepDream?

GPT-2 is best described IMHO as "DeepDream for text." They use different neural network architectures, but that's because analyzing images and natural language require different architectures. Fundamentally their complete-the-prompt-using-training-data design is the same.

And while DeepDream creates incredibly surreal visual imagery, it is simply not capable of deep insights or originality. It really just morphs an image into a reflection of its training data, but in a fractal complex way that has the surface appearance of meaning. So too does GPT-2 extend a prompt as a reflection of its training data, but without any deep insight or understanding.

I would argue that an approach in animation, where a keyframe artist is not sure exactly where to go with a scene, so he draws the keyframe, hands it to interpolative animators with the request to 'start drawing where you think this is going', and looks at the results for inspiration for the next 'keyframe' will probably result in a lot of wasted effort by the interpolators, and is probably inferior (in terms of cost and time) to plenty of other techniques available to the keyframe artist; but also that it has a moderate to high probability of eventually inspiring something useful if you do it enough times.

Your problem (excuse me) is that you keep imagining these scenarios with a human actor as the inbetweener or the writer. Yes, if even an artist with basic skills is given the opportunity to 'in-between' without limits and invent their own animation, some of them are bound to be good. But if you hand the same keyframe to DeepDream and expect it to 'interpolate' the next fame, and then the frame after, etc., you'd be crazy to expect anything other than fractal replay of its training data. That's all it can do.

All GPT-2 can do is replay remixed versions of its own training data based on the prompts. Originality is not in the architecture.

But by all means, spend your $1000 on it. Maybe you'll learn something in the process.

GPT-2 is best described IMHO as "DeepDream for text." They use different neural network architectures, but that's because analyzing images and natural language require different architectures. Fundamentally their complete-the-prompt-using-training-data design is the same.

If by 'fundamentally the same' you mean 'actually they're completely different and optimize completely different things and give completely different results on completely different modalities', then yeah, sure. (Also, a dog is an octopus.) DeepDream is a iterative optimization process which tries to maximize the class-ness of an image input (usually, dogs); a language model like GPT-2 is predicting the most likely next observation in a natural text dataset which can be fed its own guesses. They bear about as much relation as a propaganda poster and a political science paper.


Gwern, I respect you but sometimes you miss the mark. I was describing a particular application of deep dream in which the output is fed in as input, which doesn’t strike me as any different from your own description of GPT-2.

A little less hostility in your comment and it would be received better.

Feeding in output as input is exactly what is iterative about DeepDream, and the scenario does not change the fact that GPT-2 and DeepDream are fundamentally different in many important ways and there is no sense in which they are 'fundamentally the same', not even close.

And let's consider the chutzpah of complaining about tone when you ended your own highly misleading comment with the snide

But by all means, spend your $1000 on it. Maybe you’ll learn something in the process.


There was no snide there. I honestly think he’ll learn something of value. I don’t think he’ll get the result he wanted, but he will learn something in the process.

I predict that you think artwork created with StyleGAN by definition cannot have artistic merit on its own.

Which is amusing because when people look at StyleGAN artwork and they don't realize it, like my anime faces, they often quite like it. Perhaps they just haven't seen anime faces drawn by a true Scotsman yet.

When TWDNE went up, I asked 'how long will I have to read and mash refresh before I see a cute face with a plot I would probably be willing to watch while bored at 2am' The answer was 'less than 10minutes', and this is either commentary on the effectiveness of the tool, or on my (lack of?) taste.

I have a few pieces of artwork I've made using StyleGAN that I absolutely love, and absolutely could not have made without the tool.

When I noticed a reply from 'gwern', I admit was mildly concerned that there would be a link to a working webpage and a paypal link, I'm pretty enthusiastic about the idea but have not done anything at all to pursue it.

Do you think training a language model, whether it is GPT-2 or a near term successor entirely on math papers could have value?

When I noticed a reply from ‘gwern’, I admit was mildly concerned that there would be a link to a working webpage and a paypal link

Oh, well, if you want to pay for StyleGAN artwork, that can be arranged.

Do you think training a language model, whether it is GPT-2 or a near term successor entirely on math papers could have value?

No, but mostly because there are so many more direct approaches to using NNs in math, like (to cite just the NN math papers I happened to read yesterday) planning in latent space or seq2seq rewriting. (Just because you can solve math problems in natural language input/output format with Transformers doesn't mean you should try to solve it that way.)

Thank you for this!

It seems that my ignorance is on display here, the fact that these papers are new to me shows just how out of touch with the field I am. I am unsurprised that 'yes it works, mostly, but other approaches are better' is the answer, and should not be surprised that someone went and did it.

It looks like the successful Facebook AI approach is several steps farther down the road than my proposal, so my offer is unlikely to provide any value outside of the intellectual exercise for me, so I'm probably not actually going to go through with it--by the time the price drops that far, I will want to play with the newer tools.

Waifulabs is adorable and awesome. I've mostly been using style transfers on still life photos and paintings, I have human waifu selfie to anime art on my to do list but it has been sitting there for a while.

Are you planning integration with DeepAnime and maybe WaveNet so your perfect waifus can talk? Though you would know if that's a desirable feature for your userbase better than I would...

On the topic, it looks like someone could, today, convert a selfie of a partner into an anime face, train wavenet on a collection of voicemails, and train a generator using an archive of text message conversations, so that they could have inane conversations with a robot, with an anime face reading the messages to them with believable mouth movements.

I guess the next step after that would be to analyze the text for inferred emotional content (simple approaches with NLP might get really close to the target here, pretty sure they're already built), and warp the voice/eyes for emotional expression (I think WaveNet can do this for voice, if I remember correctly?

Maybe a deepfake type approach that transforms the anime girls using a palatte of a set of representative emotion faces? I'd be unsurprised if this has already been done, though maybe it's niche enough that it has not been.

This brings to mind an awful idea: In the future I could potentially make a model of myself and provide it as 'consolation' to someone I am breaking up with. Or worse, announce that the model has already been running for two weeks.

I suspect that today older style, still image heavy anime could probably be crafted entirely using generators (limited editing of the writing, no animators or voice actors), is there a large archive of anime scripts somewhere that a generator could train on, or is that data all scattered across privately held archives?

What do you think?

Thank you for the post! You raise some great questions. I’m not an expert, but I’d like to check something. Is it possible that Pinker and Marcus are trying to distinguish blabbering from well-reasoned text? If we take that perspective for a moment, Bert or GPT-2 would appear more reasoned, when their training text is more focused. The broader their training text, the less reasoned their output. It’s just a hypothesis.

"If it talks likes a duck, walks like a duck, it is functionally a duck."

I believe behaviorist/functionalist definitions of mental capacity (incl. language) are important because they represent how language is used in practise. It seems like the linguists you reference have elevated meaning into the test of linguistic competence maybe because they are trying to preserve human exceptionalism.

However, it could also be that linguists are unimpressed with recent ML-driven models of language not only because the models fail on certain probes but also because we've all become habituated to twitter bots and automated phishing scams which both rely on crude (by today's standards) language automation. I've worked at a publisher before which received dozens of manuscripts which were linguistically well formed but void of meaning.

Maybe we're all overly obsessed with essence and authenticity in the same way that a forged painting has zero value even if it is physically (99%) indistinguishable to the original.

I liked how you put this. I've just posted my (approving) response to this on Less Wrong under the title "Does GPT-2 Understand Anything?"

so do babblers actually have less moral worth?