human psycholinguists: a critical appraisal

[-]Alexei6y*221

I’m definitely not an expert, but it does sound like a few people moved the goalposts. It sounds very similar to the "AI can't play chess" argument switching to "well, it's not really playing chess, it's just following rules" and then to *quietly mumbling something about it not having the True representation of chess*.

So, I agree with the core point. GPT-2 is writing! And it's writing pretty damn well. Whatever is left has more to do with the general cognition skill than writing.

[-]nostalgebraist5y200Review for 2019 Review

I wrote this post about a year ago. It now strikes me as an interesting mixture of

Ideas I still believe are true and important, and which are (still) not talked about enough
Ideas that were plausible at the time, but are much less so now
Claims I made for their aesthetic/emotional appeal, even though I did not fully believe them at the time

In category 1 (true, important, not talked about enough):

GPT-2 is a source of valuable evidence about linguistics, because it demonstrates various forms of linguistic competence that previously were only demonstrated by humans.
Much scholarly ink has been spilled over questions of the form "what would it take, computationally, to do X?" -- where X is something GPT-2 can actually do. Since we now have a positive example, we should revisit these debates and determine which claims GPT-2 disproves, and which it supports.
Some of the key participants in those debates are not revisiting them in this way, and appear to think GPT-2 is entirely irrelevant to their work.

In category 2 (plausible then but not now):

"The structure of the transformer is somehow specially apt for language, relative to other architectures that were tried."
- I now think this is much less likely thanks to the 2 OpenAI scaling papers in 2020.
- The first paper made it seem more plausible that LSTMs would behave like GPT-2 if given a much larger quantity of compute/data
- The second paper showed that the things we know about transformers from the text domain generalize very well to image/video/math
- I now think transformers are just a "good default architecture" for our current compute regime and may not have special linguistic properties
I'm finding this difficult to phrase, but in 2019 I think I believed Gary Marcus had similar preconceptions to me but was misreading the current evidence.
- I now think he's more committed to the idea that GPT-2-like approaches are fundamentally barking up the wrong tree, and will maintain this idea in the face of arbitrary feats of competence.

In category 3 (misleading):

I overstated the similarity between what Marcus wanted in 2001, and what has actually occurred.
- I think Marcus wanted neural nets to be designed in a very top-down, constrained way, baking in lots of human prior knowledge.
- ConvNets do bake in (a very simple kind of) prior knowledge.
- But, though LSTMs and transformers are more "structured" than fully connected nets, the structure is not intended to encode prior knowledge.
- Nothing in the recently successful architectures looks like the deliberate design, aimed at enforcing known linguistic regularities, that Marcus argued for.
- I was aware of the vast gap between "more structure than the literal minimum possible" and "the kind of structure Marcus wanted," but conflated the two. Possibly because I thought the resulting irony was appealing, and/or because it was suggested the disagreement was illusory and was thus emotionally appealing.

In sum, I still like the writing and humor in this post, and I think it makes some important observations, but I also think it leaves the reader with some importantly wrong impressions.

[-]Lone Pine4y40

I think Gary Marcus wanted AI research to uncover lots of interesting rules like "in English, you make verbs past tense by adding -ed, except ..." because he wants to know what the rules are, and because engineering following psycholinguistic research is much more appealing to him than the other way around. Machine learning (without interpretability) doesn't give us any tools to learn what the rules are.

[-]Vivek Hebbar4y20

Maybe add a disclaimer at the start of the post?

[-]habryka6y201

Promoted to curated: I found this post quite compelling, and also generally think that understanding how people have historically modelled progress in AI (and to what degree we have beaten benchmarks that we thought previously were quite difficult, and how much goalpost-moving there is) is a pretty important aspect of modeling future developments in the field.

[-]lionhearted (Sebastian Marshall)6y110

Lots of useful ideas here, thanks.

Did you play AI Dungeon yet, by chance?

https://www.aidungeon.io/

Playing it was a bit of a revelation for me. It doesn't have to get much better at all to obsolete the whole lower end of formulaic and derivative entertainment...

[-]ErickBall6y80

If compositionality is necessary, then this sort of “deep learning” implements compositionality, even if this fact is not superficially obvious from its structure.

But compositionality mostly isn't necessary for the kind of writing GPT-2 does. Try getting it to tell you how many legs a dog has. One? Three? Five? It doesn't know, because people rarely write things like "a dog has four legs" in its input data. Here's GPT-2:

A dog has the same number of legs as a man, but has fewer legs than a gorilla. It has a lot of brains, but they are divided equally between the two front legs.

It's very good at coming up with sentences that are grammatically and stylistically correct, but it has no concept of whether they're true. Now, maybe that's just a result of interacting exclusively with language--people generally learn how many legs a dog has by looking at one, not by hearing about them. But even when it comes to purely linguistic issues, it basically doesn't make true statements. This is typified by its habit of contradicting itself (or repeating itself) within the same sentence:

A convincing argument requires certain objects to exist. Otherwise it's not really science. For instance, according to Stromberg, all the atoms in a dog's body exist, and the image of the dog, even though it does not exist, exists (p. 120).

[-]nostalgebraist6y30

Hmm... I think you are technically right, since "compositionality" is typically defined as a property of the way phrases/sentences/etc. in a language relate to their meanings. Since language modeling is a task defined in terms of words, without involving their referents at all, GPT-2 indeed does not model/exhibit this property of the way languages mean things.

But the same applies identically to every property of the way languages mean things! So if this is really the argument, there's no reason to focus specifically on "compositionality." On the one hand, we would never expect to get compositionality out of any language model, whether a "deep learning" model or some other kind. On the other hand, the argument would fail for any deep learning model that has to connect words with their referents, like one of those models that writes captions for images.

If we read the passage I quoted from 2019!Marcus in this way, it's a trivially true point about GPT-2 that he immediately generalizes to a trivially false point about deep learning. I think when I originally read the passage, I just assumed he couldn't possibly mean this, and jumped to another interpretation: he's saying that deep learning lacks the capacity for structured representations, which would imply an inability to model compositionality even when needed (e.g. when doing image captioning as opposed to language modeling).

Fittingly, when he goes on to describe the problem, it doesn't sound like he's talking about meaning but about having flat rather than hierarchical representations:

Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure.

In The Algebraic Mind, Marcus critiqued some connectionist models on the grounds that they cannot support "structured representations." Chapter 4 of the book is called "Structured Representations" and is all about this, mostly focused on meaning (he talks a lot about "structured knowledge") but not at all tied to meaning specifically. Syntax and semantics are treated as equally in need of hierarchical representations, equally impossible without them, and equally possible with them.

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge. So when 2019!Marcus wrote about "compositionality," I assumed that he was making this argument, again, about GPT-2. In that case, GPT-2's proficiency with syntax alone is a relevant datum, because Marcus and conventional linguists believe that syntax alone requires structured representations (as against some of the connectionists, who didn't).

[-][anonymous]6y20

Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge.

For what it's worth, I think you're saying the same thing as my critique about concept modeling, if that's what you're referring to.

[-]Ben Pace5y70Review for 2019 Review

This was a great read at the time and still holds up. It's one of the rare artifacts that can only produced after a decade or two, which is an account of major shifts in a person's perspective over the course of a decade or two. (In that way it's similar in genre for me as Buck's post in the review.)

It's a very excitingly written history, and gives me insight into the different perspectives on the issue of psycholinguistics, and helps me frame the current situation in AI. I expect to vote on this somewhere between +5 and +7.

[-]Zack_M_Davis5y60Nomination for 2019 Review

What does Nostalgebraist do? They fucking write. (Engaging and educational.)

[-]Luke Allen6y60

Having spent some time on the GPT2 Subreddit Simulator watching bots talk like the members of those forums, and having spent a summer caring for my grandmother with dementia, I definitely want to draw chilling parallels between their output.

We've given computers the instinct for talking; now I find myself wondering what would happen if we gave GPT2 ten years of The Wall Street Journal back-issues and then hooked it up to one of those marvelous AIs that runs 401(k) stock market picks.

[-]Richard_Ngo6y60

I broadly agree with the sentiment of this post, that GPT-2 and BERT tell us new things about language. I don't think this claim relies on the fact that they're transformers though - and am skeptical when you say that "the transformer architecture was a real representational advance", and that "You need the right architecture". In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

[-]nostalgebraist6y60

In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).

I think it's misleading to view "amount of inductive bias" as a one-dimensional scale, with the transformer somewhere "between" CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once -- weight sharing between positions, and locality -- and these are two very different things, not just two (perhaps differently sized) injections of "more bias" on our hypothetical 1D bias scale.

For example, locality without weight sharing is certainly conceivable (I can't remember if I've seen it before), but I'd imagine it would do very poorly on text data, because it relaxes the CNN constraint that's appropriate for text while keeping the one that's inappropriate. If you compare that to the transformer, you've got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture's representational aptness for a given domain isn't just a function of some 1D "amount of inductive bias" in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.

As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that "moving to a superset" shouldn't be simplified to "reducing some 1D 'bias' variable," I'd also say that "moving to a superset" isn't what happened anyway.

Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?

Disagree. Not that this seems deeply impossible or anything, but it's exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there's less incentive to do it.

On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.

[-]Steven Byrnes6y50

I found a recent paper that ran a fun little contest on whether three seq2seq models (LSTMS2S, ConvS2S and Transformer) are "compositional", for various definitions of "compositional": The compositionality of neural networks: integrating symbolism and connectionism. (Answer: Basically yes, especially the Transformer.) That was somewhat helpful, but I still feel like I don't really understand what exactly these models are learning and how (notwithstanding your excellent Transformer blog post), or how their "knowledge" compares with the models built by the more directly brain-inspired wing of ML (example), or for that matter to actual brain algorithms. I need to think about it more. Anyway, thanks for writing this, it's a helpful perspective on these issues.

[-]Jeff Otieno6y*40

"You need the right architecture. You need, maybe, just maybe, an architecture that can tell us a thing or two about the human brain."

I liked this article. I don't think GPT-2 can tell us anything about how the human brain works, though.

Regardless of how well GPT-2 writes, it does not understand language, remotely. I've taken an excerpt from https://openai.com/blog/better-language-models/ :

"GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text"

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

This reality actually drives one of the key distinctions between human neurology and AI - humans often only need one record in their "training set" in order to begin making wildly accurate predictions, because humans turn that record into a rule that can be immediately generalized, while an AI would be positively useless with so little data.

A good example:

Imagine a human being is introduced to a deer for the first time. They are told that the animal they are looking at is called a "deer". From then on, every single time they see a deer, they will know it's a deer, without any other pieces of data.

In contrast, building an AI that could begin correctly identifying images of deer after being exposed to just one training record, without sacrificing the AIs ability to be as discerning as it needs to be (is that a bird in that brush?), is extraordinarily out of reach at the moment.

EDIT: Also thought I'd point out that GPT-2's efficacy declines substantially as the length of its writing increases (I forget the exact number but after 300 words or something it all goes to mumbo jumbo). That, to me, strongly indicates that GPT-2 is not imitating human neurology at all.

[-][anonymous]6y40

What you describe is correct about GPT-2 and a correct response to the article, but be careful to not over-generalize. There are ways of making AIs with "one-shot learning."

[-]Jeff Otieno6y20

I had a sense I was kind of overstepping when I wrote that...

Do those AI frameworks tend to be very discerning though? I imagine they tend to have high recall and low precision on valid test cases too dissimilar from the single training case.

[-][anonymous]6y30

The Wikipedia article is pretty good on this subject:

https://en.wikipedia.org/wiki/One-shot_learning

[-]nostalgebraist6y30

I should say first that I completely agree with you about the extreme data inefficiency of many systems that get enthusiastically labeled "AI" these days -- it is a big problem which calls into question many claims about these systems and their displays of "intelligence."

Especially a few years ago (the field has been getting better about this over time), there was a tendency to define performance with reference to some set collection of tasks similar to the training task without acknowledging that broader generalization capacity, and generalization speed in terms of "number of data points needed to learn the general rule," are key components of any intuitive/familiar notion of intelligence. I've written about this in a few places, like the last few sections of this post, where I talk about the "strange simpletons."

However, it's not clear to me that this limitation is inherent to neural nets or to "AI" in the way you seem to be saying. You write:

Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.

If I understand you correctly, you're taking a position that Marcus argued against in The Algebraic Mind. I'm taking Marcus' arguments there largely as a given in this post, because I agree with them and because I was interested specifically in the way Marcus' Algebraic Mind arguments cut against Marcus' own views about deep learning today.

If you want to question the Algebraic Mind stuff itself, that's fine, but if so you're disagreeing with both me and Marcus more fundamentally than (I think) Marcus and I disagree with one another, and you'll need a more fleshed-out argument if you want to bridge a gulf of this size.

[-]Ben Pace5y*20Nomination for 2019 Review

I found this a very valuable history of the field, by you telling the evolution and history of your understanding of the field. This helped me understand linguistics and the import of GPT-2 much better.

[-]Andre Infante6y20

I think Gary Marcus and his crowd are largely ridiculous in their criticism of GPT-2. Nobody coming at the situation with open-eyes would deny that this is such a massive leap forward in language models that it makes everything else ever tried look embarrassing. The goalposts have moved so far that the players on the field can't see them without a good telescope.

However, I do think their criticism does highlight some interesting properties of these systems. They're right that you can pose reasoning problems to Transform-based language models that these problems really struggle with (it seems to have a great deal of difficulty counting characters). The architecture also scales poorly to long samples of text because of the GIGO death spiral problem. However, properly trained on a toy dataset, it can do multi-step logical reasoning with a high degree of accuracy (although, it's worth noting, not infinitely long). So it's certainly not entirely incapable of reproducing symbolic reasoning, but it has several major deficits in this respect.

If anything, Transformer-based language models reminds me of some accounts of people with serious but localized brain damage: people who don't have intact mental faculties, but can still speak relatively coherently. I think maybe the best model for the situation is that TBLM architectures are *capable of* logic, but not *well suited to* logic. If you train these things only on logical problems, you can force them to learn to model logic, to some extent, but it's fundamentally kind of an awkward fit. TBLMs are great at sequence problems and just okay at hierarchical reasoning. You can kind of see this in the deficits that remain in the stupidly large 10+ billion parameter models various big tech companies have been training, and their steeply diminishing returns. Some problems don't get solved with just a very big transformer.

It may be that there's some other "brain region" / network architecture needed in order to extend these systems to perform well on text-based problem-solving in general. But if that's where the goal-posts are now, we certainly live in exciting times.

[-]ErickBall6y20

Wait, why is it impossible for a fully-connected network trained by backpropagation to generalize across unseen input and output nodes? Is this supposed to be obvious?

[-]nostalgebraist6y40

For the full argument from Marcus, read the parts about "training independence" in The Algebraic Mind ch. 2, or in the paper it draws from, "Rethinking Eliminative Connectionism."

The gist is really simple, though. First, note that if some input node is always zero during training, that's equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you'll always multiply by zero when doing the chain rule), so they'll never get updated from their initial values.

Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the "bias." So if the input node is supposed to carry some information, the network still can't learn what it is; it just thinks it's updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that's always constant, or as N such nodes.)

The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it'd work to set the weights to zero and the bias to $f^{- 1} (c)$ where $f$ is the activation function and $c$ is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.

None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.

[-][anonymous]6y20

GPT-2 is not that interesting, sorry. It is not a model of how humans think, so don’t make the mistake of thinking you can draw accurate inferences from it with respect to how humans think without a heck of a lot of caveats. It also doesn’t solve any of the “hard” problems about semantics and speech production. It basically shows that large data and statistics, taken to ridiculous limits, can potentially be used to automate lower level aspects of speech such as word choice. But that’s it really, and even there we don’t have high understanding of how to couple something like GPT-2 to a higher level thinking machine.

[-]JamesFaville6y30

This comment does not deserve to be downvoted; I think it's basically correct. GPT-2 is super-interesting as something that pushes the bounds of ML, but is not replicating what goes on under-the-hood with human language production, as Marcus and Pinker were getting at. Writing styles don't seem to reveal anything deep about cognition to me; it's a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.

[-]Vaniver6y240

Writing styles don't seem to reveal anything deep about cognition to me; it's a question of word/punctuation choice, length of sentences, and other quirks that people probably learn associatively as well.

But isn't it interesting that the way human linguists thought word/punctuation choice worked in humans failed to produce human-like speech, and yet GPT-2 successfully produces human-like speech? Yes, obviously, it's the babbler instead of the full brain. But that definitely lines up with my internal experience, where I have some 'conceptual realm' that hands concepts off to a babbler, which then generates sentences, in a way that lines up with how GPT-2 seems to operate (where I can confidently start a sentence and not know how I'll finish it, and then it's sensible by the time I get there).

[-][anonymous]6y*30

That’s not a novel result though. We’ve basically known those aspects of speech to be associative for decades. Indeed it is pretty hard to explain many frequent errors in human speech models without associative generative models. There are some outliers, like Chomsky, who persist in pushing unrealistic models of human speech, but for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked.

Now reducing that assumption to practice is a huge engineering accomplishment which I don’t mean to belittle. But the OP is wondering why linguists are not all infatuated with GPT-2. The answer is that there wasn’t that much to be learned from a theorist perspective. They already assigned >90% probability that GPT-2 models something like how speech production works. So having it reduced to practice isn’t that big of an update, in terms of Bayesian reasoning. It’s just the wheel of progress turning forward.

[-]Tomás B.6y*270

They already assigned >90% probability that GPT-2 models something like how speech production works.

Is that truly the case? I recall reading Corey Washington a former linguist (who left the field for neuroscience in frustration with its culture and methods) claim that when he was a linguist the general attitude was there was no way in hell something like GPT-2 would ever work even close to the degree that it does.

Found it:

Steve: Corey’s background is in philosophy of language and linguistics, and also neuroscience, and I have always felt that he’s a little bit more pessimistic than I am about AGI. So I’m curious — and answer honestly, Corey, no revisionist thinking — before the results of this GPT-2 paper were available to you, would you not have bet very strongly against the procedure that they went through working?

Corey: Yes, I would’ve said no way in hell actually, to be honest with you.

Steve: Yes. So it’s an event that caused you to update your priors.

Corey: Absolutely. Just to be honest, when I was coming up, I was at MIT in the mid ’80s in linguistics, and there was this general talk about how machine translation just would never happen and how it was just lunacy, and maybe if they listened to us at MIT and took a little linguistics class they might actually figure out how to get this thing to work, but as it is they’re going off and doing this stuff which is just destined to fail. It’s a complete falsification of that basic outlook, which I think, — looking back, of course — had very little evidence — it had a lot of hubris behind it, but very little evidence behind it.

I was just recently reading a paper in Dutch, and I just simply… First of all, the OCR recognized the Dutch language and it gave me a little text version of the page. I simply copied the page, pasted it into Google Translate, and got a translation that allowed me to basically read this article without much difficulty. That would’ve been thought to be impossible 20, 30 years ago — and it’s not even close to predicting the next word, or writing in the style that is typical of the corpus.

[-]Wei Dai6y80

Do you know any promising theories of the higher levels of speech production (i.e., human verbal/symbolic reasoning)? That seems to me to be one of the biggest missing pieces at this point of a theoretical understanding of human intelligence (and of AGI theory), and I wonder if there's actually good theoretical work out there that I'm just not aware of.

[-]Steven Byrnes6y30

for the most part the field has assumed something like the Transformer model is how the lower levels of speech production worked

Can you be more specific about what you mean by "something like the Transformer model"? Or is there a reference you recommend? I don't think anyone believes that there are literally neurons in the brain wired up into a Transformer, or anything like that, right?

[-]JamesFaville6y10

As far as I'm aware, there was not (in recent decades at least) any controversy that word/punctuation choice was associative. We even have famous psycholinguistics experiments telling us that thinking of the word "goose" makes us more likely to think of the word "moose" as well as "duck" (linguistic priming is the one type of priming that has held up to the replication crisis as far as I know). Whenever linguists might have bothered to make computational models, I think those would have failed to produce human-like speech because their associative models were not powerful enough.

[-]nostalgebraist6y110

The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase "word choice."

If "word choice" just means something narrow like "selecting which noun you want to use, given that you are picking the inhabitant of a 'slot' in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey," then perhaps priming and other results about perceptions of "word similarity" might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I'm aware) believe GPT-2 works like this.

If "word choice" means something bigger that encompasses syntax, then priming experiments about single words don't tell us much about it.

I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2's stylistic fluency is less surprising than its syntactic fluency. In fact, I think that's true -- intellectually, I am more impressed by the syntax than the style.

But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam's Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.

It's more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena -- as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style "for free" out of a model that also does good syntax (or, if you prefer, good syntax "for free" out of a model that also does good style) suggests we might be scientifically on the right track.

[-][anonymous]6y20

Neither one is surprising to me at all. In fact I don't think there is a sharp divide between syntax and style--syntax is that word which we assign to culturally shared style. That's why we can define specialized syntaxes for dialectal differences. And as a structural rule, syntax/style is very relevant to word choice since it prohibits certain combinations. A big enough network will have a large enough working memory to "keep in mind" enough contextual information to effectively satisfy the syntax rules describing the styles it learned.

[-]RedMan6y10

Ability to write is probably independent of other skills. Just look at James Joyce's aphasia as reflected in Finnegan's wake. Who would expect anything of intellectual value to come out of a language generator trained on internet news articles?

I wonder how gpt-2 does if it is trained on the contents of arxiv.

[-][anonymous]6y30

I wonder how gpt-2 does if it is trained on the contents of arxiv.

What do you expect it to be like?

[-]RedMan6y10

More awesome than my puny mind can imagine.

I'd like the raw model to be trained on raw copies of as many mathematical papers and texts as possible, with 'impact factor' used as weights.

I'd also, while I'm dreaming, like to see it trained on only the math, without the prose, and a second model trained to generate the prose of math papers solely from the math contained within.

I think math papers are a better source than reddit news articles because pure mathematics is systematic, and all concepts, at least in theory, can be derived from first principles covered in other places.

Ideally, the system would generate papers with named authors and citation lists that help guide the operators to humans capable of reviewing the work.

If you believe that one single useful mathematical insight could be found with my proposed approach, it's borderline criminal to not devote effort to getting it built.

[-][anonymous]6y50

I don’t believe any mathematical insights could be gained from that approach. That’s not what GPT-2 does. It does not form original concepts: that would require a reasoning and inference engine it lacks. It merely remixes its training data. What you’ll get out is a nonsensical paper that is formatted and structured to look like an academic paper but with content that only takes the appearance of meaning. Basically just lorum ipsum at a higher level.

In a sense it has already been done. There are papers that have been generated with similar tools and submitted to actual reviewed journals—and accepted, although that was more of a proof to show that these journals didn’t actually do any peer review. (The field was philosophy IIRC, not mathematics, but that should be irrelevant to this point.)

[-]RedMan6y10

I assert that if gpt-2 can write interesting looking poetry, it can probably do interesting mathematics.

I think that there is a wide space between 'boring and useless' and 'groundbreakingly insightful', and that this particular system can generate things in that space.

I think my view here is 'less than cautious' optimism. I am not sure what it takes to justify the expenditure to openai to test this assertion. It sounds like a fairly expensive project (data collection, training time), so better people than I will have to decide to throw money at it, and that decision will be made using criteria that are opaque to me.

[-][anonymous]6y*40

It can write pretty looking poetry. It can learn to mimc metric and rhyme, and replay constructs it has seen before, which imitates poetry of a certain taste. It cannot write interesting poetry. That would require conceptual modeling of what it is writing and goal-driven thought processes which it lacks entirely.

It's important to have a technical understanding of these things so that you can separate potentially good ideas from what would ultimately be a waste of time. GPT-2 cannot formulate original concepts. It's just architecturally not possible.

[-]RedMan6y*30

My standard for interesting poetry is clearly different from (inferior to?) yours. If I understand you correctly, I predict that you think artwork created with StyleGAN by definition cannot have artistic merit on its own.

So we appear to be at an impasse. I do not see how you can simultaneously dismiss the value of the system for generating things with artistic merit (like poetry, mathematics, or song lyrics), and simultaneously share the anxieties of the developers about its' apparent effectiveness at generating propaganda.

AI systems have recently surprised people by being unusually good at strange things, I think opimism for a creative profession like pure math is warranted. In short, the potential payoff (contributions to pure math) is massive, the risk is just an amount of money that in this industry is actually fairly small and the egos of people who believe that 'their' creative field (math) could not be conquered by ML models that can only do 'derivative' things.

I assert that at some point in the next two years, there will exist an AI engine which when given the total body of human work in mathematics and a small prompt (like the one used in gpt-2), is capable of generating mathematical works that humans in the field find interesting to read, provided of course that someone bothers to try.

If the estimated cost for actually training the model I described above, and thus ending this discussion, drops below $1000, and it has not been done, I will simply do it.

Edit: gwern actually did an arxiv paper from gpt-3, it was kind of interesting but probably not all that useful. I do not think I'll be putting money up on this anytime soon, but do persist in my belief that within the next year, a mathematician would be able to use AI systems to generate mathematical ideas that are interesting to professionals in the field. I feel that someone with more knowledge than me could probably make the argument that the solution to protein folding represents an achievement in this space.

[-][anonymous]6y40

You are looking for originality. GPT-2 offers remixing. These are not the same thing. I don't know how to say that more clearly.

[-]RedMan6y10

The disagreement is about whether 'remixing' can result in 'originality'.

We are in agreement about the way gpt-2 works, and the types of outputs it produces, just disagreeing about whether they meet our criteria for 'interesting' or 'original'. I believe that our definitions of those two things necessarily include a judgement call about the way we feel about 'orginality' and 'insight' as a human phenomenon.

Some attempts to explicate this agreement to see if I understand your position:

I argue that this track, which is nothing but a mashup of other music, stands as an interesting creative work in its' own right. I suspect that you disagree, as it is just 'remixing': https://www.youtube.com/watch?v=YFg5q2hSl2E&app=desktop

I would also believe that gpt-2, properly trained on the whole of the Talmud (and nothing else), with the older stuff prioritized, could probably produce interesting commentary, particular if specific outputs are seeded with statements like 'today this <thing> happened so therefore'.

I think you would ascribe no value to such commentary, due to the source being a robot remixer, rather than a scholar, regardless of the actual words in the actual output text.

If I remember the gpt-2 reddit thread correctly, most comments were trash, some of them made reading the rest of it worthwhile to me.

Just like a 'real' reddit thread.

[-][anonymous]6y30

Don't be confused about the reuse of the same word: 'remix' has a different meaning in applying to the YouTube video that you linked. To create that music video a creative, goal-directed general intelligence (the artist) took some source/training material and assembled an original song by making use of the artist's generative concept modeling. That capability does not exist in the case of GPT-2, whose 'remixing' is more along the lines of replay than reimagine.

To use another art example, traditional animation is performed by skilled artists creating keyframe, and then junior artists doing the 'in-between' interpolative drawing. Making the keyframe requires originality. The in-betweening takes a certain amount of skill and there is some artistry involved, but it is fundamentally still just interpolation and there's a reason the keyframer, not the in-betweener puts their name on the final product.

GPT-2 is doing verbal in-between'ing. You're asking it to make keyframes. I'm saying it's not going to work, at least not any better than monkeys mashing keyboards will generate Shakespeare.

[-]RedMan6y10

So in your analogy, would the 'seed text' provided to gpt-2 be analogous to a single keyframe, provided to an artist, and gpt-2s output be essentially what happens when you provide an interpolator (I know nothing about the craft of animation and am probably using this word wrong) a 'start' frame but no 'finish' frame?

I would argue that an approach in animation, where a keyframe artist is not sure exactly where to go with a scene, so he draws the keyframe, hands it to interpolative animators with the request to 'start drawing where you think this is going', and looks at the results for inspiration for the next 'keyframe' will probably result in a lot of wasted effort by the interpolators, and is probably inferior (in terms of cost and time) to plenty of other techniques available to the keyframe artist; but also that it has a moderate to high probability of eventually inspiring something useful if you do it enough times.

In that context, I would view the unguided interpolation artwork as 'original' and 'interesting', even though the majority of it would never be used.

Unlike the time spent by animators interpolating, running trained gpt-2 is essentially free. So, in absolute terms, this approach, even if it produces garbage the overwhelming majority of the time, which it will, is moderate to very likely to find interesting approaches with a low, but reasonable for human reviewers, probability (meaning, the human must review dozens of worthless outputs, not hundreds of millions like the monkeys on typewriters).

I suspect that a mathematician with the tool I proposed could type in a thesis, see what emerges, and have a moderate to high probability of eventually encountering some text that inspires something like the following thought: 'well, this is clearly wrong, but I would not have thought to associate this thesis with that particular technique, let me do some work of my own and see if there is anything to this'.

I view the output in that particular example to be 'encountering something interesting', and the probability if it occurring at least once if my proposed tool were to be developed to be moderate to high, and that the cost in terms of time spent reviewing outputs would not be high enough to make the approach have negative value to the proposed user community.

I price the value of bringing this tool into existence in terms of the resources available to me personally as 'worth a bit less than $1000 usd'.

[-][anonymous]6y00

So in your analogy, would the 'seed text' provided to gpt-2 be analogous to a single keyframe, provided to an artist, and gpt-2s output be essentially what happens when you provide an interpolator (I know nothing about the craft of animation and am probably using this word wrong) a 'start' frame but no 'finish' frame?

Precisely. Also, are you familiar with Google's DeepDream?

https://en.wikipedia.org/wiki/DeepDream

GPT-2 is best described IMHO as "DeepDream for text." They use different neural network architectures, but that's because analyzing images and natural language require different architectures. Fundamentally their complete-the-prompt-using-training-data design is the same.

And while DeepDream creates incredibly surreal visual imagery, it is simply not capable of deep insights or originality. It really just morphs an image into a reflection of its training data, but in a fractal complex way that has the surface appearance of meaning. So too does GPT-2 extend a prompt as a reflection of its training data, but without any deep insight or understanding.

I would argue that an approach in animation, where a keyframe artist is not sure exactly where to go with a scene, so he draws the keyframe, hands it to interpolative animators with the request to 'start drawing where you think this is going', and looks at the results for inspiration for the next 'keyframe' will probably result in a lot of wasted effort by the interpolators, and is probably inferior (in terms of cost and time) to plenty of other techniques available to the keyframe artist; but also that it has a moderate to high probability of eventually inspiring something useful if you do it enough times.

Your problem (excuse me) is that you keep imagining these scenarios with a human actor as the inbetweener or the writer. Yes, if even an artist with basic skills is given the opportunity to 'in-between' without limits and invent their own animation, some of them are bound to be good. But if you hand the same keyframe to DeepDream and expect it to 'interpolate' the next fame, and then the frame after, etc., you'd be crazy to expect anything other than fractal replay of its training data. That's all it can do.

All GPT-2 can do is replay remixed versions of its own training data based on the prompts. Originality is not in the architecture.

But by all means, spend your $1000 on it. Maybe you'll learn something in the process.

[-]gwern6y30

GPT-2 is best described IMHO as "DeepDream for text." They use different neural network architectures, but that's because analyzing images and natural language require different architectures. Fundamentally their complete-the-prompt-using-training-data design is the same.

If by 'fundamentally the same' you mean 'actually they're completely different and optimize completely different things and give completely different results on completely different modalities', then yeah, sure. (Also, a dog is an octopus.) DeepDream is a iterative optimization process which tries to maximize the class-ness of an image input (usually, dogs); a language model like GPT-2 is predicting the most likely next observation in a natural text dataset which can be fed its own guesses. They bear about as much relation as a propaganda poster and a political science paper.

[-][anonymous]6y20

Gwern, I respect you but sometimes you miss the mark. I was describing a particular application of deep dream in which the output is fed in as input, which doesn’t strike me as any different from your own description of GPT-2.

A little less hostility in your comment and it would be received better.

[-]gwern6y20

Feeding in output as input is exactly what is iterative about DeepDream, and the scenario does not change the fact that GPT-2 and DeepDream are fundamentally different in many important ways and there is no sense in which they are 'fundamentally the same', not even close.

And let's consider the chutzpah of complaining about tone when you ended your own highly misleading comment with the snide

But by all means, spend your $1000 on it. Maybe you’ll learn something in the process.

[-][anonymous]6y20

There was no snide there. I honestly think he’ll learn something of value. I don’t think he’ll get the result he wanted, but he will learn something in the process.

[-]RedMan2y10

https://deepmind.google/discover/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/

I'm gonna say I won this one.

[-]RedMan2mo10

I've seen suggestions that GPT-5 is doing original math, though people in the comments are disagreeing: https://www.reddit.com/r/artificial/comments/1mw5512/gpt5_just_casually_did_new_mathematics_it_wasnt/

6 years on, I'm still happy with my comments, but the person who was more upvoted than me appears much less happy, as they have gone anonymous.

[-]gwern6y30

I predict that you think artwork created with StyleGAN by definition cannot have artistic merit on its own.

Which is amusing because when people look at StyleGAN artwork and they don't realize it, like my anime faces, they often quite like it. Perhaps they just haven't seen anime faces drawn by a true Scotsman yet.

[-]RedMan6y10

When TWDNE went up, I asked 'how long will I have to read and mash refresh before I see a cute face with a plot I would probably be willing to watch while bored at 2am' The answer was 'less than 10minutes', and this is either commentary on the effectiveness of the tool, or on my (lack of?) taste.

I have a few pieces of artwork I've made using StyleGAN that I absolutely love, and absolutely could not have made without the tool.

When I noticed a reply from 'gwern', I admit was mildly concerned that there would be a link to a working webpage and a paypal link, I'm pretty enthusiastic about the idea but have not done anything at all to pursue it.

Do you think training a language model, whether it is GPT-2 or a near term successor entirely on math papers could have value?

[-]gwern6y30

When I noticed a reply from ‘gwern’, I admit was mildly concerned that there would be a link to a working webpage and a paypal link

Oh, well, if you want to pay for StyleGAN artwork, that can be arranged.

Do you think training a language model, whether it is GPT-2 or a near term successor entirely on math papers could have value?

No, but mostly because there are so many more direct approaches to using NNs in math, like (to cite just the NN math papers I happened to read yesterday) planning in latent space or seq2seq rewriting. (Just because you can solve math problems in natural language input/output format with Transformers doesn't mean you should try to solve it that way.)

[-]RedMan6y10

Thank you for this!

It seems that my ignorance is on display here, the fact that these papers are new to me shows just how out of touch with the field I am. I am unsurprised that 'yes it works, mostly, but other approaches are better' is the answer, and should not be surprised that someone went and did it.

It looks like the successful Facebook AI approach is several steps farther down the road than my proposal, so my offer is unlikely to provide any value outside of the intellectual exercise for me, so I'm probably not actually going to go through with it--by the time the price drops that far, I will want to play with the newer tools.

Waifulabs is adorable and awesome. I've mostly been using style transfers on still life photos and paintings, I have human waifu selfie to anime art on my to do list but it has been sitting there for a while.

Are you planning integration with DeepAnime and maybe WaveNet so your perfect waifus can talk? Though you would know if that's a desirable feature for your userbase better than I would...

On the topic, it looks like someone could, today, convert a selfie of a partner into an anime face, train wavenet on a collection of voicemails, and train a generator using an archive of text message conversations, so that they could have inane conversations with a robot, with an anime face reading the messages to them with believable mouth movements.

I guess the next step after that would be to analyze the text for inferred emotional content (simple approaches with NLP might get really close to the target here, pretty sure they're already built), and warp the voice/eyes for emotional expression (I think WaveNet can do this for voice, if I remember correctly?

Maybe a deepfake type approach that transforms the anime girls using a palatte of a set of representative emotion faces? I'd be unsurprised if this has already been done, though maybe it's niche enough that it has not been.

This brings to mind an awful idea: In the future I could potentially make a model of myself and provide it as 'consolation' to someone I am breaking up with. Or worse, announce that the model has already been running for two weeks.

I suspect that today older style, still image heavy anime could probably be crafted entirely using generators (limited editing of the writing, no animators or voice actors), is there a large archive of anime scripts somewhere that a generator could train on, or is that data all scattered across privately held archives?

What do you think?

[-]rkarmani6y10

Thank you for the post! You raise some great questions. I’m not an expert, but I’d like to check something. Is it possible that Pinker and Marcus are trying to distinguish blabbering from well-reasoned text? If we take that perspective for a moment, Bert or GPT-2 would appear more reasoned, when their training text is more focused. The broader their training text, the less reasoned their output. It’s just a hypothesis.

[-]humanatbest6y10

"If it talks likes a duck, walks like a duck, it is functionally a duck."

I believe behaviorist/functionalist definitions of mental capacity (incl. language) are important because they represent how language is used in practise. It seems like the linguists you reference have elevated meaning into the test of linguistic competence maybe because they are trying to preserve human exceptionalism.

However, it could also be that linguists are unimpressed with recent ML-driven models of language not only because the models fail on certain probes but also because we've all become habituated to twitter bots and automated phishing scams which both rely on crude (by today's standards) language automation. I've worked at a publisher before which received dozens of manuscripts which were linguistically well formed but void of meaning.

Maybe we're all overly obsessed with essence and authenticity in the same way that a forged painting has zero value even if it is physically (99%) indistinguishable to the original.

[-]Douglas Summers-Stay6y10

I liked how you put this. I've just posted my (approving) response to this on Less Wrong under the title "Does GPT-2 Understand Anything?"

[-]romeostevensit6y10

so do babblers actually have less moral worth?

LESSWRONG
LW

LESSWRONG
LW

185

human psycholinguists: a critical appraisal

185

185