The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs

Pierre Peigné

One concern I have is that there are many claims here about what was or was not present in the training data. We don't know what training data GPT-4 used, and it's very plausible that, for instance, lots of things that GPT-3 and GPT-3.5 were asked were used in training, perhaps even with custom, human written answers. (You did mention that you don't know exactly what it was trained on, but there's still an implicit reliance. So mostly I'm just annoyed that OpenAI isn't even open about the things that don't pose any plausible risks, such as what they train on.)

And this is not to say I disagree - I think the post is correct. I just worry that many of the claims aren't necessarily possibly to justify.

[-]Noosphere899mo40

More and more, I'm updating towards that a non-trivial (perhaps almost all) of GPT-4's capability is downstream of being able to store the ~entire internet in it's mind, which trivializes the problem.

That doesn't mean it's a stochastic parrot, or not intelligent, but it does have implications for a potential future, if AI pretraining was only scaled:

https://x.com/DimitrisPapail/status/1888325914603516214

https://www.lesswrong.com/posts/i7JSL5awGFcSRhyGF/shortform-2#s6xSyKkDLgpcD9wPw

[-]Quentin FEUILLADE--MONTIXI2y1-1

I agree. However, I doubt that the examples from argument 4 are in the training, I think this is the strongest argument. The different scenario came out of my mind and I didn't find any study / similar topic research with the same criteria as in the appendix (I didn't search a lot though).

[-]Davidmanheim2y20

I agree that, tautologically, there is some implicit model that enables the LLM to infer what will happen in the case of the ball. I also think that there is a reasonably strong argument that whatever this model it, it in some way maps to "understanding of causes" - but also think that there's an argument the other way, that any map between the implicit associations and reality is so convoluted that almost all of the complexity is contained within our understanding of how language maps to the world. This is a direct analog of Aaronson's "Waterfall Argument" - and the issue is that there's certainly lots of complexity in the model, but we don't know how complex the map between the model and reality is - and because it routes through human language, the stochastic parrot argument is, I think, that the understanding is mostly contained in the way humans perceive language.

[-]Fabien Roger2y41

I think the links to the playground are broken due to the new OAI playground update.

[-]Quentin FEUILLADE--MONTIXI2y20

Thanks for the catch!

[-]lenivchick2y3-1

True, but you can always wriggle out saying that all of that doesn't count as "truly understanding". Yes, LLM's capabilities are impressive, but does drawing SVG changes the fact that somewhere inside the model all of these capabilities are represented by "mere" number relations?

Do LLM's "merely" repeat the training data? They do, but do they do it "merely"? There is no answer, unless somebody gives a commonly accepted criterion of "mereness".

The core issue with that is of course that since no one has a more or less formal and comprehensive definition of "truly understanding" that everyone agrees with - you can play with words however you like to rationalize whatever prior you had about LLM.

Substituting one vaguely defined concept of "truly understanding" with another vaguely defined concept of a "world model" doesn't help much. For example, does "this token is often followed by that token" constitutes a world model? If not - why not? It is really primitive, but who said world model has to be complex and have something to do with 3D space or theory of mind to be a world model? Isn't our manifest image of reality also a shadow on the wall since it lacks "true understanding" of underlying quantum fields or superstrings or whatever in the same way that long list of correlations between tokens is a shadow of our world?

The "stochastic parrot" argument has been an armchair philosophizing from the start, so no amount of evidence like that will convince people that take it seriously. Even if LLM-based AGI will take over the world - the last words of such a person gonna be "but that's not true thinking". And I'm not using that as a strawman - there's nothing wrong with a priori reasoning as such, unless you doing it wrong.

I think the best response to "stochastic parrot" is asking three questions:

1. What is your criterion of "truly understanding"? Answer concretely in a terms of the structure or behavior of the model itself and without circular definitions like "having a world model" which is defined as "conscious experience" and that is defined as "feeling redness of red" etc. Otherwise the whole argument becomes completely orthogonal to any reality at all.

2. Why do you think LLM's do not satisfy that criterion and human brain does?

3. Why do you think it is relevant to any practical intents and purposes, for example to the question "will it kill you if you turn it on"?

[-]VeritableCB2y21

I don't think this line of argumentation is actually challenging the concept of stochastic parroting on a fundamental level. The ability of generative ML to create images or solve math problems or engage in speculation about stories, etc, were all known to the researchers who coined the term; these things you point to, far from challenging the concept of stochastic parrots, are assumed to be true by these researchers.

When you point to these models not understanding how reciprocal relationships between objects work, but apologize for it by reference to its ability to explain who Tom Cruise's mother is, I think you miss an opportunity to unpack that. If we imagine LLMs as stochastic parrots, this is a textbook example: the LLM cannot make a very basic inference when presented with novel information. It only gets this "right" when you ask it about something that's already been written about in its training data many times: a celebrity's mother.

The model is very excellent at reproducing reasoning that it has been shown examples of: Tom Cruise has a mother, so we can reason that his mother has son named Tom Cruise. For your sound example, there is information about how sound propagation works on the internet for the model to draw on. But could the LLM speculate on some entirely new type of physics problem that hasn't been written about before and fed into its model? How far can the model move laterally into entirely new types of reasoning before it starts spewing gibberish or repeating known facts?

You could fix a lot of these problems. I have no doubt that at some point they'll work out how to get ChatGPT to understand these reciprocal relationships. But the point of that critique isn't to celebrate a failure of the model and say it can never be fixed, the point is to look at these edge cases to help understand what's going on under the hood: the model is replicating reasoning it's seen before, and yes, that's impressive, but it cannot reliably employ reasoning to truly novel problem types because it is not reasoning. You may not find that troubling, and that's your prerogative, truly, but I do think it would be useful for you to grapple with the idea that your arguments are compatible with the stochastic parrots concept, not a challenge to them.

[-]the gears to ascension2y20

the new OAI update has deployed a GPT4 version which was trained with vision, GPT4-turbo. not sure if that changes anything you're saying.

[-]eggsyntax2y10

Like you I thought this argument had faded into oblivion, but I'm certainly seeing it a lot on twitter currently as people talk about Claude 3 seeming conscious to some people. So I've been thinking about it, and it doesn't seem clear to me that it makes any falsifiable claims. If anyone would find it useful, I can add a list of the relevant claims I see being made in the paper and in the Wikipedia entry on stochastic parrots, and some analysis of whether each is falsifiable.

[-]gwern2y86

'Stochastic parrots' 2020 actually does make many falsifiable claims. Like the original stochastic parrots paper even included a number of samples of specific prompts that they claimed LLMs could never do. Likewise, their 'superintelligent octopus' example of eavesdropping on (chess, IIRC) game transcripts is the claim that imitation or offline RL for chess is impossible. Lack of falsifiable claims was not the problem with the claims made by eg. Gary Marcus.

The problem is that those claims have generally all been falsified, quite rapidly: the original prompts were entirely soluble by LLMs back in 2020, and it is difficult to accept the octopus claims in the light of results like https://arxiv.org/abs/2402.04494#deepmind . (Which is probably why you no longer hear much about the specific falsifiable claims made by the stochastic parrots paper, even by people still citing it favorably.) But then the goalposts moved.

[-]ryan_greenblatt2y*20

'Stochastic parrots' 2020 actually does make many falsifiable claims. [...] The problem is that those claims have generally all been falsified, quite rapidly.

The paper seems quite wrong to me, but I actually don't think any of the specific claims have been falsified other than the the specific "three plus five" claim in the appendix.

The specific claims:

An octopus trained on just "trivial notes" wouldn't be able to generalize to thoughts on coconut catapults. Doesn't seem clear that this has been falsified depending on how you define "trivial notes" (which is key). (Let's suppose these notes don't involve any device construction?) Separately, it's not as though human children would generalize...
The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.
If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won't generalize to predicting outputs. (Seems likely true to me, but uninteresting?)
If you train an model on text and images separately, it won't generalize to answering text questions about images. (Seems clearly true to me, but also uninteresting? More interesting would be you train on text and images, then just train to answer questions about dogs and see if it generalizes to cats. I think this could work with current models and is likely to work if you expand the question training set to be more general (but still to exclude cats.). (E.g. GPT-4 clearly can generalize to identifying novel objects which are described.).)
An LLM will never be able to answer "three plus five equals". Clearly falsified and obvious so. Likely they intended additional caveats about the training data??? (Otherwise memorization clearly works...)

For each of these specific cases, it seems pretty silly because clearly you can just train your LLM on a wide variety of stuff. (Similar to humans.) Also, I think you can train humans on purely text and do perfectly fine... (Though I'm not aware of clear experiments here because even blind and deaf people have touch. You'd want to take a blind+deaf person and then only acquire semantics via braille.)

I think you can do experiments which very compellingly argue against this paper, but I don't really see specific claims being falsified.

[-]gwern2y*50

An octopus trained on just "trivial notes" wouldn't be able to generalize to thoughts on coconut catapults.

I don't believe they say "just". They describe the two humans as talking about lots of things, including but not limited to daily gossip: https://aclanthology.org/2020.acl-main.463.pdf#page=4 The 'trivial notes' part is simply acknowledging that in very densely-sampled 'simple' areas of text (like the sort of trivial notes one might pass back and forth in SMS chat), the superintelligent octopus may well succeed in producing totally convincing text samples. But if you continue on to the next page, you see that they continue giving hostages to fortune - for example, their claims about 'rope'/'coconut'/'nail' are falsified by the entire research area of vision-language models like Flamingo, as well as reusing frozen LLMs for control like Saycan. Turns out text-only LLMs already have plenty of visual grounding hidden in them, and their textual latent spaces align already to far above chance levels. So much for that.

The same octopus, but asked about defending from bears. I claim the same is true as with the prior example.

It's not because the bear example is again like the coconut catapult - the cast-away islanders are not being chased around by bears constantly and exchanging 'trivial notes' about how to deal with bear attacks! Their point is that this is a sort of causal model and novel utterance a mere imitation of 'form' cannot grant any 'understanding' of. (As it happens, they are embarrassingly wrong here, because their bear example is not even wrong. They do not give what they think would be the 'right' answer, but whatever answer they gave, it would be wrong - because you are actually supposed to do the exact opposite things for the two major kinds of bears you would be attacked by in North America. Therefore, there is no answer to the question of how to use sticks when 'a bear' chases you. IIRC, if you check bear attack safety guidelines, the actual answer is that if one type attacks you, you should use the sticks to try to defend yourself and appear bigger; but if the other type attacks you, this is the worst thing you can possibly do and you need to instead play dead. And if you fix their question to specify the bear type so there is a correct answer, then the LLMs get it right.) You can gauge the robustness & non-falsification of their examples by noting that after I rebutted them back in 2020, they refused to respond, dropped those examples silently without explanation from their later papers, and started calling me an eugenicist.

If you train an model on text and images separately, it won't generalize to answering questions about both images. (Seems clearly true to me

I assume you mean 'won't generalize to answering questions about both modalities', and that's false.

If you train an LLM on just Java code, but with all references to input/output behavior stripped out, it won't generalize to predicting outputs. (Seems likely true to me, but uninteresting?)

I don't know if there's anything on this exact scenario, but I wouldn't be surprised if it could 'generalize'. Although you would need to nail this down a lot more precisely to avoid them wriggling out of it: does this include stripping out all comments, which will often include input/output examples? Is pretraining on natural language text forbidden? What exactly is a 'LLM' and does this rule out all offline RL or model-based RL approaches which try to simulate environments? etc.

[-]ryan_greenblatt2y40

I assume you mean 'won't generalize to answering questions about both modalities', and that's false.

Oops, my wording was confusing. I was imagining something like having a transformer which can take in both text tokens and image tokens (patches), but each training sequence is either only images or only text. (Let's also suppose we strip text out of images for simplicity.)

Then, we generalize to a context which has both images and text and ask the model "How many dogs are in the image?"

[-]eggsyntax2y10

'Stochastic parrots' 2020 actually does make many falsifiable claims. Like the original stochastic parrots paper even included a number of samples of specific prompts that they claimed LLMs could never do.

The Bender et al paper? "On the Dangers of Stochastic Parrots"? Other sources like Wikipedia cite that paper as the origin of the term.

I'll confess I skipped parts of it (eg the section on environmental costs) when rereading it before posting the above, but that paper doesn't contain 'octopus' or 'game' or 'transcript', and I'm not seeing claims about specific prompts.

[-]eggsyntax2y10

Oh, no, I see, I think you're referring to Bender and Koller, "Climbing Toward NLU"? I haven't read that one, I'll ~~read~~ skim it now.

[-]eggsyntax2y30

OK, yeah, Bender & Koller is much more bullet-biting, up to and including denying that any understanding happens anywhere in a Chinese Room. In particular they argue that completing "three plus five equals" is beyond the ability of any pure LM, which is pretty wince-inducing in retrospect.

I really appreciate that in that case they did make falsifiable claims; I wonder whether either author has at any point acknowledged that they were falsified. [Update: Bender seems to have clearly held the same positions as of September 23, based on the slides from this talk.]

[-]ryan_greenblatt2y20

I really appreciate that in that case they did make falsifiable claims; I wonder whether either author has at any point acknowledged that they were falsified

AFAICT, the only falsified claim in the paper is the "three plus five equals" claim you mentioned. This is in this appendix and doesn't seem that clear to me what they mean by "pure LLM". (Like surely they agree that you can memorize this?)

The other claims are relatively weak and not falsified. See here

[-]M Ls2y10

I agree with the other comments here suggesting that working hard enough on an animals' language patterns in LLMs will develop models of the animals' worlds based on that language use, and so develop better contexted answers in these reading comprehension questions. With no direct experience of the world.

The SVG stuff is an excellent example of there being available explicit short cuts in the data set. Much of that language use by humans and their embodied world/worldview/worldmaking is is not that explicit. To arrive at that tacit knowledge is interesting.

If beyond the stochastic parrot, now or soon, are we at the stage of stochastic maker of organ-grinders and their monkeys? (Who can churn out explicit lyrics about the language/grammar animals and their avatars use to build their worlds/markets. )

If so there may be a point where we are left asking, Who is master, the monkey or the organ? And thus we miss the entire point?

Poof. The singularity has left us behind wondering what that noise was.

Are we there yet?

[-]Quentin FEUILLADE--MONTIXI2y*20

I partially agree. I think stochastic parrot-ness is a spectrum. Even humans behave as stochastic parrots sometimes (for me it's when I am tired). I think, though that we don't really know what an experience of the world really is, and so the only way to talk about it is through an agent's behaviors. The point of this post is that SOTA LLM are probably farther in the spectrum than most people expect (My impression from experience is that GPT4 is ~75% of the way between total stochastic parrot and human). It is better than human in some task (some specific ToM experience like the example in argument 2), but still less good in others (like at applying nuances. It can understand them, but when you want it to actually be nuanced when it acts, you only see the difference when you ask for different stuff). I think it is important to build a measure for stochastic parrot ness as this might be an useful metric for governance and a better proxy for "does it understand the world it is in?" (which I think is important for most of the realistic doom scenarios). Also, these experiences are a way to give a taste of what LLM psychology look like.

[-]lenivchick2y10

Given that in the limit (infinite data and infinite parameters in the model) LLM's are world simulators with tiny simulated humans inside writing text on the internet, the pressure applied to that simulated human is not understanding our world, but understanding that simulated world and be an agent inside that world. Which I think gives some hope.

Of course real world LLM's are far from that limit, and we have no idea which path to that limit gradient descent takes. Eliezer famously argued about whole "simulator vs predictor" stuff which I think relevant to that intermidiate state far from limit.

Also RLHF applies additional weird pressures, for example a pressure to be aware that it's an AI (or at least pretend that it's aware, whatever that might mean), which makes fine-tuned LLM's actually less save than raw ones.

^{^}

I didn’t do cherry-picking on the examples. I tried each of the examples at least 10 times with a similar setup, and they all worked for GPT-4. Although I selected only a portion of the most interesting scenarios for this post.

^{^}

On the day I did this demo, OpenAI rolled out ChatGPT-4 image reading capability. So I decided to do those examples on the playground with gpt-4-0613 to show that it can even do it without having ever seen anything afaik

^{^}

On a previous version of GPT-4 (around early September 2023) it did guess correctly on the first try but I can’t reproduce with any current version in the playground.

^{^}

The objects were generated with GPT-4 and I did manual edits to try to reduce the chances that this image was in the training data. I tested around 15 simple objects which all worked. I also tried 4 other complex objects which kind of worked but not perfectly (like the articulated lamp guessed as a street lamp)

^{^}

It could be interesting to investigate this ability further. What is learned by heart? What kind of algorithm they build internally? What is the limit? …

^{^}

This would indeed imply a weaker world model (or Theory of Mind) if it cannot make good predictions but does not refute its existence just on the basis of bad predictions.

^{^}

I left this example because it is the first one I made and I used it quite a lot during debates

^{^}

Actually, clues about the non-bidirectional encoding of knowledge were discussed by Jacques in his critique of the ROME/MEMIT papers.

LESSWRONG
LW

LESSWRONG
LW

52

The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs

52

52

Intro

Argument 1: Drawing and “Seeing”

Drawing

“Seeing”

Pinch of salt

Argument 2: Reasoning and Abstract Conceptualization

Pinch of salt

Argument 3: Theory of Mind (ToM)

Details

Argument 4: Simulating the world behind the words

Other weird phenomenon to consider

Conclusion

Appendix

Argument 2:

Argument 3:

Argument 4