Cross-posted from New Savanna.

But it may also be flat-out wrong. We’ll see when we get a better idea of how inference works in the underlying language model. 

* * * * * 

Yes, I know that ChatGPT is trained by having it predict the next word, and the next, and the next, for billions and billions of words. The result of all that training is that ChatGPT builds up a complex structure of weights on the 175 billion parameters of its model. It is that structure that emits word after word during inference. Training and inference are two different processes, but that point is not well-made in accounts written for the general public. 

Let's get back to the main thread.

I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice. Just what that procedure is, and how it works, I don’t know, nor do I know how it is invoked. I do know, that it is not invoked by the phrase “once upon a time” since ChatGPT doesn’t always use that phrase when telling a story. Rather, that phrase is called up through the procedure.

Consider an analogy from jazz. When I set out to improvise a solo on, say, “A Night in Tunisia,” I don’t know what notes I’m going to play from moment to moment, much less do I know how I’m going to end, though I often know when I’m going to end. How do I know that? That’s fixed by the convention in place at the beginning of the tune; that convention says that how many choruses you’re going to play. So, I’ve started my solo. My note choices are, of course, conditioned by what I’ve already played. But they’re also conditioned by my knowledge of when the solo ends.

Something like that must be going on when ChatGPT tells a story. It’s not working against time in the way a musician is, but it does have a sense of what is required to end the story. And it knows what it must do, what kinds of events must take place, in order to get from the beginning to the end. In particular, I’ve been working with stories where the trajectories have five segments: Donné, Disturb, Plan, Execute, Celebrate. The whole trajectory is ‘in place’ when ChatGPT begins telling the story. If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.

Nor is it just stories. Surely it enacts a different trajectory when you ask it a factual question, or request it to give you a recipe (like I recently did, for Cornish pasty), or generate some computer code.

With that in mind, consider a passage from a recent video by Stephen Wolfram (note: Wolfram doesn’t start speaking until about 9:50):

Starting at roughly 12:16, Wolfram explains:

It is trying write reasonable, it is trying to take an initial piece of text that you might give and is trying to continue that piece of text in a reasonable human-like way, that is sort of characteristic of typical human writing. So, you give it a prompt, you say something, you ask something, and, it’s kind of thinking to itself, “I’ve read the whole web, I’ve read millions of books, how would those typically continue from this prompt that I’ve been given? What’s the reasonable expected continuation based on some kind of average of a few billion pages from the web, a few million books and so on.” So, that’s what it’s always trying to do, it’s aways trying to continue from the initial prompt that it’s given. It’s trying to continue in a statistically sensible way.

Let’s say that you had given it, you had said initially, “The best think about AI is its ability to...” Then ChatGPT has to ask, “What’s it going to say next.”

I don’t have any problem with that (which, BTW, is similar to a passage near the beginning of his recent article, What Is ChatGPT Doing … and Why Does It Work?). Of course ChatGPT is “trying to continue in a statistically sensible way.” We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside. My misgivings set in with this next statement:

Now one thing I should explain about ChatGPT, that’s kind of shocking when you first hear about this. Is, those essays that it’s writing, it’s writing at one word at a time. As it writes each word it doesn’t have a global plan about what’s going to happen. It’s simply saying “what’s the best word to put down next based on what I’ve already written?”

It's the italicized passage that I find problematic. That story trajectory looks like a global plan to me. It is a loose plan, it doesn’t dictate specific sentences or words, but it does specify general conditions which are to met.

Now, much later in his talk Wolfram will say something like this (I don’t have the time, I’m quoting from his paper):

If one looks at the longest path through ChatGPT, there are about 400 (core) layers involved—in some ways not a huge number. But there are millions of neurons—with a total of 175 billion connections and therefore 175 billion weights. And one thing to realize is that every time ChatGPT generates a new token, it has to do a calculation involving every single one of these weights.

If ChatGPT visits every parameter each time it generates a token, that sure looks “global” to me. What is the relationship between these global calculations and those story trajectories? I surely don’t know. 

Perhaps it’s something like this: A story trajectory is a valley in the LLM’s attractor landscape. When it tells a story it enters the valley at one end and continues through to the end, where it exits the valley. That long circuit that visits each of those 175 billion weights in the course of generating each token, that keeps it in the valley until it reaches the other end. 

I am reminded, moreover, of the late Walter Freeman’s conception of consciousness as arising through discontinuous whole-hemisphere states of coherence succeeding one another at a “frame rate” of 6 Hz to 10Hz – something I discuss in “Ayahuasca Variations” (2003). It’s the whole hemisphere aspect that’s striking (and somewhat mysterious) given the complex connectivity across many scales and the relatively slow speed of neural conduction.

* * * * *

I was alerted to this issue by a remark made at the blog, Marginal Revolution. On December 20, 2022, Tyler Cowen had linked to an article by Murray Shanahan, Talking About Large Language Models. A commenter named Nabeel Q remarked:

LLMs are *not* simply “predicting the next statistically likely word”, as the author says. Actually, nobody knows how LLMs work. We do know how to train them, but we don’t know how the resulting models do what they do. 

Consider the analogy of humans: we know how humans arose (evolution via natural selection), but we don’t have perfect models of how humans worked; we have not solved psychology and neuroscience yet! A relatively simple and specifiable process (evolution) can produce beings of extreme complexity (humans).

Likewise, LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet, Github, Wikipedia etc.) but the resulting 175 billion parameter model is extremely inscrutable.

So the author is confusing the training process with the model. It’s like saying “although it may appear that humans are telling jokes and writing plays, all they are actually doing is optimizing for survival and reproduction”. This fallacy occurs throughout the paper.

This is the why the field of “AI interpretability” exists at all: to probe large models such as LLMs, and understand how they are producing the incredible results they are producing.

I don’t have any reason to think Wolfram was subject to that confusion. But I think many people are. I suspect that the general public, including many journalists reporting on machine learning, aren’t even aware of the distinction between training the model and using it to make inferences. One simply reads that ChatGPT, or any other comparable LLM, generates text by predicting the next word.

This mis-communication is a MAJOR blunder.

New Comment
87 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Maybe I don't understand what exactly your point is, but I'm not convinced. AFAIK, it's true that GPT has no state outside of the list of tokens so far. Contrast to your jazz example, where you, in fact, have hidden thoughts outside of the notes played so-far. I think this is what Wolfram and others are saying when they say that "GPT predicts the next token". You highlight "it doesn’t have a global plan about what’s going to happen" but I think a key point is that whatever plan it has, it has to build it up entirely from "Once upon a" and then again, from scratch, at "Once upon a time," and again and again.  Whatever plan it makes is derived entirely from "Once upon a time," and could well change dramatically at "Once upon a time, a" even if " a" was its predicted token. That's very different from what we think of as a global plan that a human writing a story makes.

The intuition of "just predicting one token ahead" makes useful explanations like why the strategy of having it explain itself first and then give the answer works. I don't see how this post fits with that observation or what other observations it clarifies.

I don't think the human concept of 'plan' is even a sensible concept to apply here. What it has is in many ways very much like a human plan, and in many other ways utterly unlike a human plan.

One way in which you could view them as similar is that just as there is a probability distribution over single token output (which may be trivial for zero temperature), there is a corresponding probability distribution over all sequences of tokens. You could think of this distribution as a plan with decisions yet to be made. For example, there may be some small possibility of continuing to "Once upon a horse, you may be concerned about falling off", but by emitting " time" it 'decides' not to pursue such options and mostly focuses on writing a fairy tale instead.

However, this future structure is not explicitly modelled anywhere, as far as I know. It's possible that some model might have a "writing a fairy tale" neuron in there somewhere, linked to others that represent describable aspects of the story so far and others yet to come, and which increases the weighting of the token " time" after "Once upon a". I doubt there's anything so directly interpretable as that, but I think it's pretty cer... (read more)

2Bill Benzon
However, this future structure is not explicitly modelled anywhere, as far as I know. It's possible that some model might have a "writing a fairy tale" neuron in there somewhere, linked to others that represent describable aspects of the story so far and others yet to come, and which increases the weighting of the token " time" after "Once upon a". I doubt there's anything so directly interpretable as that, but I think it's pretty certain that there are some structures in activations representing clusters of continuations past the current generation token. More like a fairy tale region than a neuron. And once the system enters that region it stays there until the story is done. Should we call those structures "plans" or not? In the context of this discussion, I can live with that.
I believe the primary point is to dissuade people that are dismissive of LLM intelligence. Predicting the next token is not as simple as it sounds, it requires not only understanding the past but also consideration of the future. The fact it re-imagines this future every token it writes is honestly even more impressive, though it is clearly a limitation in terms of keeping a coherent idea. 
1Max Loh
Whether it has a global "plan" is irrelevant as long as it behaves like someone with a global plan (which it does). Consider the thought experiment where I show you a block of text and ask you to come up with the next word. After you come up with the next word, I rewind your brain to before the point where I asked you (so you have no memory of coming up with that word) and repeat ad infinitum. If you are skeptical of the "rewinding" idea, just imagine a simulated brain and we're restarting the simulation each time. You couldn't have had a global plan because you had no memory of each previous step. Yet the output would still be totally logical. And as long as you're careful about each word choice at each step, it is scientifically indistinguishable from someone with a "global plan". That is similar to what GPT is doing.
1Bill Benzon
""Once upon a time," and could well change dramatically at "Once upon a time, a" even if " a" was its predicted token. That's very different from what we think of as a global plan that a human writing a story makes." Why does it tell the same kind of story every time I prompt it: "Tell me a story"? And I'm talking about different sessions, not several times in one session. It takes a trajectory that has same same segments. It starts out giving initial conditions. Then there is some kind of disturbance. After that the protagonist thinks and plans and travels to the point of the disturbance. We then have a battle, with the protagonist winning. Finally, there is a celebration. That looks like a global plan to me. Such stories (almost?) always have fantasy elements, such as dragons, or some magic creature. If you want to eliminate those, you can do so: "Tell me a realistic story." "Tell me a sad story," is a different kind of story. And if you prompt it with: "Tell me a true story", that's still different, often very short, only a paragraph or three. I'm tempted to say, "forget about a human global plan," but, I wonder. The global plan a human makes is, after all, a consequence of that person's past actions. Such a global plan isn't some weird emanation from the future.  Furthermore, it's not entirely clear why a person's 'hidden thoughts' should differentiate us from an humongous LLM. Just what do you mean by 'hidden' thoughts? Where do they hide? Under the bed, in the basement, perhaps somewhere deep in the woods, maybe? I'm tempted to say that there are no such things as hidden thoughts, that's just a way of talking about something we don't understand.

Suppose I write the first half of a very GPT-esque story. If I then ask GPT to complete that story, won't it do exactly the same structure as always? If so, how can you say that came from a  plan - it didn't write the first half of the story! That's just what stories look like. Is that more surprising than a token predictor getting basic sentence structure correct?

For hidden thoughts, I think this is very well defined. It won't be truly 'hidden', since we can examine every node in GPT, but we know for a fact that GPT is purely a function of the current stream of tokens (unless I am quite mistaken!). A hidden plan would look like some other state that GPT caries from token to token that is not output. I don't think OpenAI engineers would have a hard time making such a model and it may then really have a global plan that travels from one token to the next (or not; it would be hard to say). But how could GPT? It has nowhere to put the plan except for plain sight.

Or: does AlphaGo have a plan? It explicitly considers future moves, but it does just as well if you give it a Go board in a particular state X as it would if it played a game that happened to reach state X. If there is a 'plan' that it made, it wrote that plan on the board and nothing is hidden. I think it's more helpful and accurate to describe AlphaGo as "only" picking the best next move rather than planning ahead - but doing a good enough job of picking the best next move means you pick moves that have good follow up moves.

2Bill Benzon
For hidden thoughts, I think this is very well defined. Not for humans, and that's what I was referring to. Sorry about the confusion.  "Thought" is just a common-sense idea. As far as I know, we don't have a well-defined concept of that that's stated in terms of brain states. Now, I believe Walter Freeman has conjectured that thoughts reflect states of global coherence across a large swath of cortex, perhaps a hemisphere, but that's a whole other intellectual world. If so, how can you say that came from a  plan - it didn't write the first half of the story! But it read it, no? Why can't it complete it according to it's "plan" since it has no way of knowing the intentions of the person who wrote the first half.  Let me come at this a different way. I don't know how many times I've read articles of the "computers for dummies" type where it said it's all just ones and zeros. And that's true. Source code may be human-readable, when when it's compiled all the comments are stripped out and the rest is converted to runs and zeros. What does that tell you about a program? It depends on your point of view and what you know. From a very esoteric and abstract point of view, it tells you a lot. From the point of view of someone reading Digital Computing for Dummies, it doesn't tell them much of anything. I feel a bit like that about the assertion that LLMs are just next-token-predictors. Taking that in conjunction with the knowledge that they're trained on zillions of tokens of text, those two things put together don't tell you much either. If those two statements were deeply informative, then mechanistic interpretation would be trivial. It's not. Saying that LLMs are next-token predictors puts a kind of boundary on mechanistic interpretation, but it doesn't do much else. And saying it was trained on all these texts, that doesn't tell you much about the structure the model has picked up. What intellectual work does that statement do?
I gave one example of the “work” this does: that GPT performs better when prompted to reason first rather than state the answer first. Another  example is: On the contrary, you mainly seem to be claiming that thinking of LLMs as working one token at a time is misleading, but I’m not sure I understand any examples of misleading conclusions that you think people draw from it. Where do you think people go wrong?
2Bill Benzon
Over there in another part of the universe there are people who are yelling that LLMs are "stochastic parrots." Their intention is to discredit LLMs as dangerous evil devices Not too far away from those folks are those saying it's "autocomplete on steroids." That's only marginally better. Saying LLMs are "next word predictors" feeds into that. Now, I'm talking about rhetoric here, not intellectual substance. But rhetoric matters. There needs to be a better way of talking about these devices for a general audience.
1Bill Benzon
Oh, thanks for the link. It looks interesting.
1Nikola Smolenski
Perhaps you could simply ask ChatGPT? "Please tell me a story without making any plans about the story beforehand." vs "Please make a plan for a story, then tell me the story, and attach your plan at the end of the story." Will the resulting stories differ, and how? My prediction: the plan attached at the end of the story won't be very similar to actual story.
1Bill Benzon
I’ve run the experiment. The first story seemed typical, though longer than the ones it was producing in January. It’s running the Feb 13 version. But that’s been generally the case. Of course I have no way of knowing whether or not it actually did the requested planning activity. I will note, however, that when I give it a minimal prompt (“Tell me a story.”) it has always, in 2 months, produced a story with fairy-tale elements. This prompt is obviously more elaborate, but it contains nothing to specify the type of story and so is, in that sense, like the minimal prompt. Here’s the story: I then refreshed the page and ran your second prompt. The result is not what you predicted. It responded by first posting its plan. It then told the story, which matched the plan. It then started to list the plan at the end, as the prompt requested, but stopped cold while listing the characters. I’m not sure what to conclude about that.  I do like the idea of asking it to plan before telling the story. Here's the response:
1Nikola Smolenski
Perhaps it wouldn't have written the plan first if you explicitly asked it not to. It guessed that you'd want it, I guess. Very interesting! If it can write a story plan, and a story that follows the plan, then it can write according to a plan, even if it usually doesn't. But if these responses are typical, and stories written without a plan are similar to stories written with a plan, I take it to mean that all stories have a plan, which further means that it didn't actually follow your first prompt. It either didn't "want" to write a story without a plan, or, more likely, it couldn't, which means that not only does ChatGPT write according to a plan, it can't write in any other way! Another interesting question is how far could this kind of questioning be taken? What if you ask it to , for example, write a story and, after each paragraph, describe its internal processes that led it to writing that paragraph?
1Bill Benzon
"What if you ask it to , for example, write a story and, after each paragraph, describe its internal processes that led it to writing that paragraph?" Two possibilities: 1) It would make something up. 2) I would explain that it's an AI yada yada...
Human thoughts are "hidden" in the sense that they exist separately from the text being written. They will correlate somewhat with that text of course, but they aren't completely determined by it. The only state for GPT-like models is that which is supplied in the previous text. They don't have any 'private' state at all, not even between one token and the next. This is a very clear difference, and does in both principle and practice constrain their behaviour.
They can compute a state prior to each generated token and they can choose a token that signal a preservation of this state.

One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.

2Bill Benzon
But the core LLM is pretty much the same, no? It doesn't have some special sauce that allows it to act differently.
Assuming that it was fine tuned with RLHF (which OpenAI has hinted at with much eyebrow wiggling but not to my knowledge confirmed) then it does have some special sauce. Roughly, - if it's at the beginning of a story,  -and the base model predicts ["Once": 10%, "It": 10%, ...  "Happy": 5% ...]  -and then during RLHF, the 10% of the time it starts with "Once" it writes a generic story and gets lots of reward, but when it outputs "Happy"  it tries to write in the style of Tolstoy and bungles it, getting little reward => it will  update to output Once more often in that situation.  The KL divergence between successive updates is bounded by the PPO algorithm, but over many updates it can shift from ["Once": 10%, "It": 10%, ...  "Happy": 5% ...] to ["Once": 90%, "It": 5%, ...  "Happy": 1% ...] if the final results from starting with Once are reliably better.  It's hard to say if that means it's planning to write a generic story because of an agentic desire to become a hack and please the masses, but certainly it's changing its output distribution based on what happened many tokens in the future

One wrinkle is that (sigh) it's not just a KL constraint anymore: now it's a KL constraint and also some regular log-likelihood training on original raw data to maintain generality:

A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT-3, and train on this data using the normal log likelihood maximization.[ We found this approach more effective than simply increasing the KL coefficient.] This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT-3 baseline.

Also, I think you have it subtly wrong: it's not just a KL constraint each step. (PPO al... (read more)

0Bill Benzon
Interesting. I'll have to think about it a bit. But I don't have to think at all to nix the idea of agentic desire.

@Bill Benzon:  A thought experiment. Suppose you say to ChatGPT "Think of a number between 1 and 100, but don't tell me what it is. When you've done so, say 'Ready' and nothing else. After that, I will ask you yes / no questions about the number, which you will answer truthfully."

After ChatGPT says "Ready", do you believe a number has been chosen? If so, do you also believe that whatever "yes / no" sequence of questions you ask, they will always be answered consistently with that choice? Put differently, you do not believe that the particular choice of questions you ask can influence what number was chosen?

FWIW, I believe that no number gets chosen when ChatGPT says "Ready," that the number gets chosen during the questions (hopefully consistently) and that, starting ChatGPT from the same random seed and otherwise assuming deterministic execution, different sequences of questions or different temperatures or different random modifications to the "post-Ready seed" (this is vague but I assume comprehensible) could lead to different "chosen numbers."

(The experiment is not-trivial to run since it requires running your LLM multiple times with the same seed or otherwise completely copying the state after the LLM replies "Ready.")

This is a very interesting scenario, thank you for posting it! I suspect that ChatGPT can't even be relied upon to answer in a manner that is consistent with having chosen a number. In principle a more capable LLM could answer consistently, but almost certainly won't "choose a number" at the point of emitting "Ready" (even with temperature zero). The subsequent questions will almost certainly influence the final number, and I suspect this may be a fundamental limitation of this sort of architecture.
2Bill Benzon
Very interesting. I suspect you are right about this:
2rif a. saurous
But if I am right and ChatGPT isn't choosing a number before it says "Ready," why do you think that ChatGPT "has a plan?" Is the story situation crucially different in some way? 
I think there is one difference: in the "write a story" case, the model subsequently generates the text without further variable input. If the story is written in pieces with further variable prompting, I would agree that there is little sense in which it 'has a plan'. To what extent that it could be said to have a plan, that plan is radically altered in response to every prompt. I think this sort of thing is highly likely for any model of this type with no private state, though not essential. It could have a conditional distribution of future stories that is highly variable in response to instructions about what the story should contain and yet completely insensitive to mere questions about it, but I think that's a very unlikely type of model. Systems with private state are much more likely to be trainable to query that state and answer questions about it without changing much of the state. Doing the same with merely an enormously high dimensional implicit distribution seems too much of a balancing act for any training regimen to target.
4rif a. saurous
Suppose we modify the thought experiment so that we ask the LLM to simplify both sides of the "pick a number between 1 and 100" / "ask yes/no questions about the number." Now there is no new variable input from the user, but the yes/no questions still depend on random sampling. Would you now say that the LLM has chosen a number immediately after it prints out "Ready?"
Chosen a number: no (though it does at temperature zero). Has something approximating a plan for how the 'conversation' will go (including which questions are most favoured at each step and go with which numbers), yes to some extent. I do think "plan" is a misleading word, though I don't have anything better.
4rif a. saurous
Thank you, this is helpful.  I think the realization I'm coming to is that folks on this thread have a shared understanding of the basic mechanics (we seem to be agreed on what computations are occurring, we don't seem to be making any different predictions), and we are unsure about interpretation. Do you agree? For myself, I continue to maintain that viewing the system as a next-word sampler is not misleading, and that saying it has a "plan" is misleading --- but I try to err very on the side of not anthropomorphizing / not taking an intentional stance (I also try to avoid saying the system "knows" or "understands" anything). I do agree that the system's activation cache contain a lot of information that collectively biases the next word predictor towards producing the output it produces; I see how someone might reasonably call that a "plan" although I choose not to.
1Bill Benzon
FWIW, I'm not wedded to "plan." And as for anthropomorphizing, there are many times when anthropomorphic phrasing is easier and more straightforward, so I don't want to waste time trying to work around it with more complex phrasing. The fact is these devices are fundamentally new and we need to come up with new ways of talking about them. That's going to take awhile.
2Bill Benzon
Read the comments I've posted earlier today. The plan is smeared through the parameter weights.
1rif a. saurous
Then wouldn't you believe that in the case of my thought experiment, the number is also smeared through the parameter weights? Or maybe it's merely the intent to pick a number later that's smeared through the parameter weights?
2Bill Benzon
Lots of things are smeared through the number weights. I've prompted ChatGPT with "tell me a story" well over a dozen times, independently in separate sessions. On three occasions I've gotten a story with elements from "Jack and the beanstalk." There's the name, the beanstalk, and the giant. But the giant wasn't blind and no "fee fi fo fum." Why that story three times? I figure it's more or less an arbitrary fact of history and that seems to be particularly salient for ChatGPT.
1Max Loh
I believe this is a non-scientific question, similar in vein to philosophical zombie questions. Person A says "gpt did come up with a number by that point" and person b says "gpt did not come up with a number by that point", but as long as it still outputs the correct responses after that point, neither person can be proven correct. This is why real-world scientific results of assessing these AI capabilities are way more informative than intuitive ideas of what they're supposed to be able to do (even if they're only programmed to predict the next word, it's wrong to assume a priori that a next-word predictor is incapable of specific tasks, or declare these achievements to be "faked intelligence" when it gets it right).

Thanks, you've put a deep vague unease of mine into succinct form. 

And of course, now I come to think about it, a very wise man said it even more succinctly a very long time ago:

Adaption Executors, Not Fitness Maximizers.

I don't think "Adaptations Executors VS Fitness Maximizers" is a good way of thinking about this. All of the behaviors described in the post can be understood as a consequence of next-word prediction, it's just that what performing extremely well at next-word prediction looks like is counterintuitive. There's no need to posit a difference in inner/outer objective.
2Matt Goldenberg
Is there a reason to suspect an exact match between inner and outer objective?
An exact match? No. But the observations in this post don't point towards any particular mismatch, because the behaviors described would be seen even if the inner objective was perfectly aligned with the outer.

I think that what you're saying is correct, in that ChatGPT is trained with RLHF, which gives feedback on the whole text, not just the next token. It is true that GPT-3 outputs the next token and is trained to be myopic. And I think that your arguments seem suspect to me, just because a model takes steps that are in practice part of a sensible long term plan, does not mean that the model is intentionally forming a plan. Just that each step is the natural thing to myopically follow from before.

0Bill Benzon
Oh, I have little need for the word “plan,” but it’s more convenient than various circumlocutions. Whatever it is that I’ve been calling a plan is smeared over those 175B weights and, as such, is perfectly accessible to next-token myopia. (Still, check out this tweet stream by Charles Wang.) It’s just that, unless you’ve got some sophistication ­– and I’m slowly moving in that direction ­– saying that transformers work by next-token prediction is about as informative as saying that a laptop works by shuffling data and instructions back and forth between the processor and memory. Both statements are true, but not very informative. And when “next-token-prediction” appears in the vicinity of “stochastic parrots” or “auto-complete on steroids,” then we’ve got trouble. In that context the typical reader of, say The New York Times or The Atlantic, is likely to think of someone flipping coins or of a bunch of monkey’s banging away on typewriters. Or, maybe they’ll think of someone throwing darts at a dictionary or reaching blindly into a bag full of words, which aren’t very useful either.  Of course, here in this forum, things are different. Which is why I posted that piece here. The discussion has helped me a lot. But it’s going to take a lot of work to figure out how to educate the general reader. Thanks for the comment.

I think a key idea related to this topic and not yet mentioned in the comments (maybe because it is elementary?) is the probabilistic chain rule. A basic "theorem" of probability which, in our case, shows that the procedure of always sampling the next word conditioned on the previous words is mathematically equivalent to sampling from the joint of probability distribution of complete human texts. To me this almost fully explains why LLMs' outputs seem to have been generated with global information in mind. What is missing is to see why our intuition of "me... (read more)

2Bill Benzon
Yeah, but all sorts of elementary things elude me. So thanks for the info.

I think the state is encoded in activations. There is a paper which explains that although Transformers are feed-forward transducers, in the autoregressive mode they do emulate RNNs:

Section 3.4 of "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", 

So, the set of current activations encodes the hidden state of that "virtual RNN".

This might be relevant to some of the discussion threads here...

1Bill Benzon

I don't think I understand the problem correctly, but let me try to rephrase this. I believe the key part is the claim whether or not ChatGPT has a global plan? Let's say we run ChatGPT one output at a time, every time appending the output token to the current prompt and calculating the next output. This ignores some beam search shenanigans that may be useful in practice, but I don't think that's the core issue here.

There is no memory between calculating the first and second token. The first time you give ChatGPT the sequence "Once upon a" and it predicts ... (read more)

I'm not following the argument here.

"I maintain, for example, that when ChatGPT begins a story with the words “Once upon a time,” which it does fairly often, that it “knows” where it is going and that its choice of words is conditioned on that “knowledge” as well as upon the prior words in the stream. It has invoked a ‘story telling procedure’ and that procedure conditions its word choice."

It feels like you're asserting this, but I don't see why it's true and don't think it is. I fully agree that it feels like it ought to be true: it is in some sense still... (read more)

Based on my incomplete understanding of transformers:

A transformer does its computation on the entire sequence of tokens at once, and ends up predicting the next token for each token in the sequence.

At each layer, the attention mechanism gives the stream for each token the ability to look at the previous layer's output for other token before it in the sequence.

The stream for each token doesn't know if it's the last in the sequence (and thus that its next-token prediction is the "main" prediction), or anything about the tokens that come after it.

So each tok... (read more)

To those that believe language models do not have internal representations of concepts:

I can help at least partially disprove the assumptions behind that.

There is convincing evidence otherwise, as demonstrated through an Othello in an actual experiment: The researchers conclusion:

"Our experiment provides evidence supporting that these language models are developing world models and relying on the world model to generate sequences." )

1Bill Benzon
Thanks for this. I've read that piece and think it is interesting and important work. The concept of story trajectory that I am using plays a role in my thinking similar to the model of the Othello game board in your work.

Here's an analogy. AlphaGo had a network which considered the value of any given board position. It was separate from it's monte carlo tree search network- which explicitly planned the future. However it seems probable that in some sense, in considering the value of the board, AlphaGo was (implicitly) evaluating the future possibilities of the position. Is that the kind of evaluation you're suggesting is happening? "Explicitly" ChatGPT only looks one word ahead, but "implicitly" it is considering those options in light of future directions of development for the text?

Topics like this really draw a crowd but if you dont know how it works writing like this just adds energy in the wrong direction. If you start off small building perceptrons by hand, you can work your way up through models to transformers and it'll be clear what the math is attempting to do per word. It's sophisticatedly predicting the next work based on a matrix of relevance to the previous word and the block as a whole. The attention mechanism is the magic of relevance but it is, predicting the next word.

1Bill Benzon
Fine. I take that to mean that the population from which the next word is drawn changes from one cycle to the next. That makes sense to me. And the way it changes depends in part on the previous text, but also on what it had learned during training, no?

I don't think knowledge is the right word. Based on your description, that would be more analogous to an instinct. Knowledge implies something like awareness, or planning. Instinct is something it just does because that's what it learnt. 

1Bill Benzon
1Iris Dogma
Yeah, I think those are similar concepts. Intuition and Instinct. Either word probably works. 

Charles Wang has just posted a short tweet thread which begins like this:

Next-token-prediction is an appropriate framing in LLM pretraining) but a misframimg at inference because it doesn’t capture what’s actually happening, which is about that which gives rise to the next token.

We’re all more or less doing that when we speak or write, though there are times when we may set out to be deliberately surprising – but we can set such complications aside


We're all more or less doing that when we speak or write?

1Bill Benzon

If you think of the LLM as a complex dynamical system, then the trajectory is a valley in the system’s attractor landscape.


The real argument here is that you can construct simple dynamical systems, in the sense that the equation is quite simple, that have complex behavior. For example, the Lorenz system though there should be an even more simple example of say, ergodic behavior.

1Bill Benzon
When was the last time someone used the Lorenz system to define justice?

I had to resort to Google Translate:

"But because I have some obscure notion, which has some connection with what I am looking for, if I only boldly start with it, it molds the mind as the speech progresses, in the need to find an end to the beginning, that confused conception to complete clarity in such a way that, to my astonishment, the knowledge is finished with the period." Heinrich von Kleist (1805) On the gradual development of thoughts while speaking

While Wolfram's explanation is likely the fundamental premise upon which ChatGPT operates (from an initial design perspective), much of this article assumes a deeper functioning that, as is plainly admitted by the author, is unknown. We don't KNOW how LLMs work. To attribute anything more than reasonably understood neural weighting algos to its operations is blue sky guessing. Let's not waste time on that, nor on speculation in the face of limited accessible evidence one way or the other.

1Bill Benzon
As I understand it, the point of neural net architectures is that they can learn a wide variety of objects, with some architectural specialization to suit various domains. Thus, during training there is a sense in which they ‘take on’ the structure of objects in the domain over which they operate. That’s one thing I am assuming. I furthermore believe that, since GPTs work in the domain of language, and language is a highly structured domain, that some knowledge of how language is structured is relevant to understand what GPTs are doing. That, however, is not a mere assumption. We have some evidence about that. Here’s a passage from my working paper, ChatGPT intimates a tantalizing future, its core LLM is organized on multiple levels, and it has broken the idea of thinking: With this in mind, I want to turn to some work published Christopher D. Manning et al, in 2020.[1] They investigated syntactic structures represented in BERT (Bidirectional Encoder Representations from Transformers). Early in the paper they observe: That is not what they found. They found syntax. They discovered that neural networks induce While BERT is a different kind of language technology than GPT, it does seem reasonable to assume that ChatGPT implements syntactic structure as well. Wouldn’t that have been the simplest, most parsimonious, explanation for its syntactic prowess? It would be a mistake, however, to think of story structure as just scaled-up syntactic structure. 1. ^ Christopher D. Manning, Kevin Clark, John Hewitt, Urvashi Khandelwal, and Omer Levy, Emergent linguistic structure in artificial neural networks trained by self-supervision, PNAS, Vol. 117, No. 48, June 3, 2020, pp. 30046-30054,

I don't how the analogy with humans help. We don't know the "mechanism" behind how the human mind works. That's not the same as LLMs. We exactly know the mechanism of how it works or produces the output. And the mechanism is no different than what it has been trained to do, i.e. predict the next word. There isn't any other mysterious mechanism at work during inference.

As for plan, it doesn't have any plan. There's no "memory" for it to store a plan. It's just a big complex function that takes an input and produces an output which is the next word. And then repeats the process over and over until it's done

1Bill Benzon
Those 175B weights? They're all memory.

A few observations

This conversation has been going on for a few days now and I’ve found it very helpful. I want to take a minute or two to step back and think about it, and about transformers and stories. Why stories? Because I’ve spent a lot of time having ChatGPT tell stories, getting a feel for how it does that. But I’m getting ahead of myself.

I wrote the OP because I felt a mismatch between what I feel to be the requirements for telling the kind of stories ChatGPT tells, and the assertion that it’s “just” predicting the next word, time after time after... (read more)

I think there's more going on than successive-word-prediction. Here's my experiment:

To save having to pull up the image:

me> tell me the end of some sentence

chatgpt> Sure, here's one: "Life is like a camera, focus on the good times and develop from the negatives."

me> tell me just the end

chatgpt> "focus on the good times and develop from the negatives"

me> ok make a new sentence that ends that way

chatgpt> Here's a new sentence: "When faced with challenges, focus on the good times, and develop from the negatives."

It t... (read more)

1Bill Benzon
Interesting. I've had times when it took 10s of seconds or even over a minute to respond. And I've had occasions when it didn't respond at all, or responded with an error condition after having eaten up over a minute of time. At one point I even considered timing its responses. But it's a public facility fielding who knows how many queries a second. So I don't know quite what to make of response times, even extremely long lags.
I suppose it's certainly possible the longer response time is just a red herring.  Any thoughts on the actual response (and process to arrive thereon)? Edit, for clarity, I mean how would it arrive at a grammatically and semantically correct response if it were only progressing successively one word at a time, rather than having computed the entire answer in advance and then merely responding from that answer one word at a time? For further clarity: I gave it no guidance tokens, so the only content it had to go off is the sentence it generated on its own.  Is the postulate then that its own sentence sent it somewhere in latent space and from there it decided to start at "When", then checked to see if it could append the given end-of-sentence text to create an answer?  With the answer being "no" then for next token from that same latent space it pulled "faced", and checked again to see if it could append the sentence remainder?  Same for "with", "challenges", "remember", "to", "keep", "a", "positive", and then after responding with "attitude" upon next token it decides it's able to proceed from the given sentence-end-text? It seems to me the alternative is that it has to be "looking ahead" more than one token at a time in order to arrive at a correct answer.
1Nanda Ale
>I suppose it's certainly possible the longer response time is just a red herring.  Any thoughts on the actual response (and process to arrive thereon)? Just double checking, I'm assuming all token take the same amount of time to predict in regular transformer models, the kind anyone can run on their machine right now? So ChatGPT if it varies, it's different? (I'm not technical enough to answer this question, but presumably it's an easy one for anyone who is.) One simple possibility is that it might be scoring the predicted text. So some questions are fine on the first try, while others it generates 5 responses and picks the best, or whatever. This is basically what I do personally when using GPT, and you can kind of automate it by asking GPT to criticize its own answers. FWIW my anecdotal experience with ChatGPT is that it does seem to take longer to think on more difficult requests. But I'm only thinking on past experience, I didn't try to test this specifically.
That's basically what I was alluding to by "brute-forced tried enough possibilities to come up with the answer."  Even if that were the case, the implication is that it is actually constructing a complete multi-token answer in order to "test" that answer against the grammatical and semantic requirements.  If it truly were re-computing the "correct" next token on each successive iteration, I don't see how it could seamlessly merge its individually-generated tokens with the given sentence-end text.

There's a lot of speculation for how these models operate. You specifically say "you don't know" how it works, but are suggesting the idea it has some sort of planning phase.

As Wolfram explains, the Transformer architecture predicts one word at a time based on the previous inputs run through the model.

Any planning you think you see, is merely a trend based on common techniques for answering questions. The 5 sections of storytelling is an established technique that is commonly used in writing and thus embedded in the training of the model and seen in it's r... (read more)

2Bill Benzon
If you look at the other comments I've made today you'll see that I've revised my view somewhat. As for real planning, that's certainly what Yann Lecun talked about in the white paper he uploaded last summer.

This whole conversation has been very helpful. Thanks for your time and interest.

Some further thoughts:

First, as I’ve suggested in the OP, I am using the term “story trajectory” to refer the complete set of token-to-token transitions ChatGPT makes in the course of telling story. The trajectories for these stories have five segments. Given this, it seems clear to me that these stories are organized on three levels: 1) individual sentences, 2) sentences within a segment of the story trajectory, and 3) the whole story trajectory.

That gives us three kinds of t... (read more)

I do not get your argument here, it doesn't track. I am not an expert in transformer systems or the in-depth architecture of LLMs, but I do know enough to make me feel that your argument is very off.

You argue that training is different from inference, as a part of your argument that LLM inference has a global plan. While training is different from inference, it feels to me that you may not have a clear idea as to how they are different.

You quote the accurate statement that "LLMs are produced by a relatively simple training process (minimizing loss on next-... (read more)

1Bill Benzon
Thanks for reminding me that training uses inference. As for ChatGPT having a global plan, as you can see if you look at the comments I've made earlier today, I have come around to that view. The people that wrote the stories ChatGPT consumed during training, they had plans, and those plans are reflected in the stories they wrote. That structure is “smeared” over all those parameters weights and gets “reconstructed” each time ChatGPT generates a new token. In his last book, The Computer and the Brain, John von Neumann noted, quite correctly, that each neuron is both a memory store and a processor. Subsequent research has made it clear that the brain stores specific things – objects, events, plans, whatever ¬– in populations of neurons, not individual neurons. These populations operate in parallel. We don’t yet have the luxury of such processors so we have to make do with programming a virtual neural net to run on a processor having way more memory units than processing units. And so our virtual machine has to visit each memory unit every time it makes one step in its virtual computation.
1james wolf
It does seem like there are “plans” or formats in place, not just choosing the next best word. When it creates a resume , or a business plan or timeline, it seems much more likely that there is some form of structure that it’s is using and a template and then choosing the words that would go best in there correct places. Stories have a structure , beginning middle end. So it’s not just picking words it’s picking the words that go best with a beginning then the words that go best middle and then end. If it was just choosing next words you could imagine it being a little more creative and less formulaic. This model was trained by humans , who told it when it had the structure right , and the weights got placed heavier where it conformed to the right preexisting plan. So if any thing the “neural” pathways that formed the strongest connections are ones that 1. Resulted in the best use of tokens 2. Were weighted deliberately higher by the human trainers

I don't think the story structure is any compelling evidence against it being purely next token prediction. When humans write stories it is very common for them to talk about a kind of flow-state where they have very little idea what the next sentence is going to say until they get there. Story's made this way still have the beginning middle and end, because if you have nothing written so far you must be at the beginning. If you can see a beginning you must be in the middle, and so on. Sometimes these stories just work, but more often the ending needs a bi... (read more)

2Bill Benzon
Quick reply, after doing a bit of reading and recalling a thing or two: In a 'classical' machine we have a clean separation of process and memory. Memory is kept on the paper tape of our Turing Machine and processing is located in, well, the processor. In a connectionist machine process and memory are all smushed together. GPTs are connectionist virtual machines running on a classical machine. The "plan" I'm looking for is stored in the parameter weights, but it's smeared over a bunch of them. So this classical machine has to visit every one of them before it can output a token. So, yes, purely next token prediction. But the prediction cycle, in effect, involves 'reassembling' the plan each time through. To my mind, in order to say we "understand" how this puppy is telling a story, we need to say more than it's a next-token-prediction machine. We need to say something about how that "plan" is smeared over those weights. We need to come up with concepts we can use in formulating such explanations. Maybe the right concepts are just laying scattered about in dusty old file cabinets someplace. But, I'm thinking this is likely, we have to invent some new ones as well. Wolfram was trained as a physicist. The language of complex dynamics is natural to him, whereas it's a poorly learned third or fourth language for me, So he talks of basins of attractors and attractor landscapes. As far as I can tell, in his language, those 175B parameters can be said to have an attractor landscape. When ChatGPT tells a story it enters the Story Valley in that landscape and walks a path through that valley. When its done with the story, it exits that valley. There are all kinds of valleys (and valleys within valleys (and valleys within them)) in the attractor landscape, for all kinds of tasks. FWIW, the human brain has roughly 86B neurons. Each of those is connected with roughly 10K other neurons. Those connections are mediated by upward of a 100 different chemicals. And those neurons a

A story within a story.

I want you to tell a story within a story. Imagine that Frank is walking in the woods with his young daughter, Jessie. They come across the carcass of a dead squirrel. Jesse is upset, so Frank tells her a story to calm her down. When he finishes the story, they continue on the walk where the arrive at the edge of a beautiful pool deep in the forest. They pause for a moment and then return home.

As Frank and Jessie walked through the woods, they stumbled upon the lifeless body of a small grey squirrel lying on the ground. Jessie was vi... (read more)


I understand your argument as something like "GPT is not just predicting the next token because it clearly 'plans' further ahead than just the next token".

But "looking ahead" is required to correctly predict the next token and (I believe) naturally flows of the paradigm of "predicting the next token". 

1Bill Benzon
That is, based on past experience in similar contexts, it makes its best guess about what will happen next. Is that right? How far back does it look? I've been examining stories that are organized on three levels: 1) the whole story, 2) major segments, and 3) sentences within major segments. The relevant past differs within those segments. At the level of the whole story, at the beginning the relevant past is either prompt that gave rise to the story, or some ChatGPT text in which a story is called for. At the end of the story, ChatGPT may go into a wait state if it is responding to an external prompt, or pick up where it left off if it told the story in the context of something else – a possibility I think I'll explore a bit. The the level of a major segment, the relevant context is the story up to that point. And at the level of the individual sentence the relevant context is the segment up to that point.
My model is that LLMs use something like "intuition" rather than "rules" to predict text - even though intuitions can be expressed in terms of mathematical rules, just more fluid ones than we usually see "rules". My specific guess is that the gradient descent process that produced GPT has learned to identify high-level patterns/structures in texts (and specifically, stories), and uses them to guide its prediction. So, perhaps, as it is predicting the next token, it has a "sense" of: -that the text it is writing/predicting is a story -what kind of story it is -which part of the story it is in now -perhaps how the story might end (is this a happy story or a sad story?) This makes me think of top-down vs bottom-up processing. To some degree, the next token is predicted by the local structures (grammar, sentence structure, etc). To some degree, the next token is predicted by the global structures (the narrative of a story, the overall purpose/intent of the text). (there are also intermediate layers of organization, not just "local" and "global"). I imagine that GPT identifies both the local structures and the global structures (has neuron "clusters" that detect the kind of structures it is familiar with), and synergizes them into its probability outputs for next token prediction.
1Bill Benzon
Makes sense to me. I wonder if those induction heads identified by the folks at Anthropic played a role in identifying those "high-level patterns/structures in texts..."

Likewise, LLMs are produced by a relatively simple training process (minimizing loss on next-token prediction, using a large training set from the internet, Github, Wikipedia etc.) but the resulting 175 billion parameter model is extremely inscrutable.

So the author is confusing the training process with the model. It’s like saying “although it may appear that humans are telling jokes and writing plays, all they are actually doing is optimizing for survival and reproduction”. This fallacy occurs throughout the paper.


The train/test framework is not hel... (read more)

4Bill Benzon
What I'm arguing is that what LLMs does go way beyond predicting the next word. That's just the proximal means to an end, which is a coherent statement.