Is the central argumentative line of this post that high-quality & informative text in the training distribution rarely corrects itself, post-training locates the high-quality part of the distribution, and so LLMs rarely correct themselves?
Or is it the more specific claim that post-training is locating parts of the distribution where the text is generated by someone in a context that highlights their prestige from their competence, and such text rarely corrects itself?
I don't see yet why the latter would be true, so my guess is you meant the former. (Though I do think the latter prompt would more strongly imply non-self-correction).
Papers that I think make some relevant points:
I'm not sure whether this is important to the main thrust of the post, but I disagree with most of this paragraph:
Again, they're an expert in the field -- and this is the sort of claim that would be fairly easy to check even if you're not an expert yourself, just by Googling around and skimming recent papers. It's also not the sort of claim where there's any obvious incentive for deception. It's hard to think of a plausible scenario in which this person writes this sentence, and yet the sentence is false or even controversial.
In my experience, it's quite hard to check what "the gold standard" of something is, particularly in cutting-edge research fields. There are lots of different metrics on which methods compete, and it's hard to know their importance as an outsider.
And the obvious incentive for deception is that the physics prof works on NPsM, and so is talking it up (or has developed a method that beats NPsM on some benchmark, and so is talking it up to impress people with their new method ...)
Yeah, you're totally right -- actually, I was reading over that section just now and thinking of adding a caveat about just this point, but I worried it would be distracting.
But this is just a flaw in my example, not in the point it's meant to illustrate (?), which is hopefully clear enough despite the flaw.
[Note: this began life as a "Quick Takes" comment, but it got pretty long, so I figured I might as well convert it to a regular post.]
In LM training, every token provides new information about "the world beyond the LM" that can be used/"learned" in-context to better predict future tokens in the same window.
But when text is produced by autoregressive sampling from the same LM, it is not informative in the same way, at least not to the same extent[1]. Thus, sampling inevitably produces a distribution shift.
I think this is one of the reasons why it's (apparently) difficult to get instruction-tuned / HH-tuned models to report their uncertainty and level of competence accurately, rather than being overconfident.
(I doubt this is a novel point, I just haven't seen it spelled out explicitly before, and felt like doing so.)
Imagine that you read the following (as the beginning of some longer document), and you trust that the author is accurately describing themselves:
I made up all those buzzwords, but imagine that this is a real field, albeit one you know virtually nothing about. So you've never heard of "NPsM" or any other competing method.
Nonetheless, you can confidently draw some conclusions just from reading this snippet and trusting the author's self-description:
During training, LLMs are constantly presented with experiences resembling this one.
The LLM is shown texts about topics of which it has incomplete knowledge. It has to predict each token from the preceding ones.
Whatever new information the text conveys about the topic may make it into the LLM's weights, through gradient updates on this example. But even before that happens, the LLM can also use the kind of reasoning shown in the bulleted list above to improve its predictions on the text right now (before any gradient updates).
That is, the LLM can do in-context learning, under the assumption that the text was produced by an entity outside itself -- so that each part of the text (potentially) provides new information about the real world, not yet present in the LLM's weights, that has useful implications for the later parts of the same text.
So, all else being equal, LLMs will learn to apply this kind of reasoning to all text, always, ubiquitously.
But autoregressive sampling produces text that is not informative about "the world outside" in the same way that all the training texts were.
During training, when an LLM sees information it doesn't know yet, it's incentivized to think: "ooh, new info! I should leverage this to predict the rest of the text!" But during sampling, any information in the sampled text which the LLM "doesn't know" is (by definition) confabulated, and updating on it as though it's real will only make the LLM more confused about reality.
In some sense, all instruction/chat tuning (including SFT, RLHF, etc.) is simply a less crude version of the popular style of LLM prompt that starts off like:
That is, instruction/chat tuning is trying to steer the outputs of the model so that it looks like the model is conditioning on "the output is high-quality."
(I often think about these techniques as form of "ecological evaluation" as I defined it here, just in a weird "meta" way that I hadn't imagined as a possibility when I wrote that post.
Rather than fixing a specific task and then giving the model direct incentives to do a good job at that one task, these methods show the model a bunch of pairs like (task description X, text that does a good job at X), and give the model a direct incentive to produce the latter from the former. The model's generalization and language-understanding abilities are leveraged to learn the general rule "given task description X, what follows is text that does a good job at X," and this works even for X that were never seen in training.)
Unfortunately, there is an inherent tension between this kind of thing and the dynamic described above.
We want the model to give the task ("X") its best shot -- to produce something that aligns with its own internal representation of "actual high-quality X performance in the real world," as opposed to "entity Y's typical attempt at X" or "what entity Z thinks it means to do X well."
For instance, with declarative knowledge, we may want the model to report its sense of the latent actual truth that implicitly guides all training texts in one way or another, as opposed to its sense of what some particular person writing this or that particular text would probably say.
So, we (explicitly or implicitly) condition the model on quality. We convey to the model that it's generating the kind of text which is correlated with actual-truth, as opposed to just being what some guy said: text by "an expert," in a sort of idealized sense of the term.
Effectively, we make the model act as though it's always generating texts similar to my example from the Princeton physics professor, with the exact nature of the (implicit) expertise always precisely selected to be ideal for the task at hand, for correlation with actual-truth and actual-doing-a-good-job.
But then -- all else being equal, i.e. if post-training isn't set up to provide a clear signal that this is bad -- we are effectively maximizing the extent to which the LLM will exhibit the in-context learning dynamic I described earlier, with the LLM viewing its own confabulations as valuable evidence about reality, provided by a "reliable source" from the world beyond its weights!
Hence, I think, the extreme confidence of instruct/chat-tuned models, and their extreme reluctance to revise their opinions (unless directly asked to do so, and sometimes even then), or to say anything amounting to "I notice that I am confused."
Why would it say "whoops, I was wrong, the answer's actually Q (not P like I said before)"? It's an expert, it would know this sort of thing already. (What sort of expert? Why, exactly the sort who would know this sort of thing, whatever "this sort of thing" happens to be.)
Why would it notice its own confusion? To do so (and be right), it has to first say something confused. But the ideal expert is never confused in the first place. The surest way to be correlated with actual-truth is to only say true things, and never say anything else.
I don't think this is the only reason that it's difficult to get such models to accurately report their own confidence and capability level.
It's also relatively difficult to produce training data / annotations for this kind of behavior.
To produce data that trains the model to always act like an "ideal expert" (even in cases where the model doesn't have the knowledge to back up this facade), the annotator only needs to determine what's actually-true. This will train the model to do the right thing in cases where it does have the knowledge, and to bullshit in all other cases.
But, to get the model to (e.g.) say "I don't know" instead of bullshitting, the annotator needs to additionally know what the model knows, as distinct from what's actually-true. And that's hard to determine! I don't think this is difficult in some deep, fundamental sense[2], but it is at least strictly harder than just providing high-quality demonstrations.
The dynamic described earlier is an additional factor that means the behavior we want never happens by default. Therefore, we have to explicitly train for it if we want it. But as just noted, training for it is not easy.
I include the caveat "at least not to the same extent" because of nuances involving LMs doing CoT-style reasoning, LMs "reminding themselves" of things that they in-some-sense "know" yet sometimes "forget," etc.
For instance, one obvious approach would be to start off with HH chat tuning (producing an "expert" that bullshits when it doesn't know the answer), and then do a second tuning phase on text generated by this "expert" that encourages it to be more cautious in cases where the originally generated text wasn't actually-true (and/or where its content was inconsistent across multiple sampling runs, or something).