instruction tuning and autoregressive distribution shift

[-]kave1y40

Is the central argumentative line of this post that high-quality & informative text in the training distribution rarely corrects itself, post-training locates the high-quality part of the distribution, and so LLMs rarely correct themselves?

Or is it the more specific claim that post-training is locating parts of the distribution where the text is generated by someone in a context that highlights their prestige from their competence, and such text rarely corrects itself?

I don't see yet why the latter would be true, so my guess is you meant the former. (Though I do think the latter prompt would more strongly imply non-self-correction).

[-]Nathan Helm-Burger1y40

Papers that I think make some relevant points:

LLMs, without calibration fine-tuning, tend to be quite overconfident.
https://arxiv.org/abs/2306.13063 Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
https://fse.studenttheses.ub.rug.nl/32044/ Confidence is Key: Uncertainty Estimation in Large Language Models and Vision Language Models
There is a variety of ongoing work in post-training modifications to make LLMs more calibrated. It seems hard to say without comparison studies which of the many different proposed techniques (or combinations thereof) might work best for this. It does seem likely to me that this problem lessen over time.
Here are a few examples, but there are many more out there:
https://arxiv.org/abs/2403.05973 Calibrating Large Language Models Using Their Generations Only
https://arxiv.org/abs/2402.06544 Calibrating Long-form Generations from Large Language Models
https://arxiv.org/html/2404.09127v3 Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation
https://arxiv.org/abs/2404.19318 Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores
https://openreview.net/forum?id=jH67LHVOIO LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses
https://arxiv.org/abs/2404.02655 Calibrating the Confidence of Large Language Models by Eliciting Fidelity
https://arxiv.org/abs/2404.04689 Multicalibration for Confidence Scoring in LLMs
https://arxiv.org/abs/2402.04957 Reconfidencing LLMs from the Grouping Loss Perspective

[-]Thane Ruthenis1y20

o1 seems to make progress on this problem. Consider the following part of the CoT from the Math section here:

Similarly, since is of degree…
Let me compute the degree of $s (x)$

It starts a thought that's supposed to complete in some statement of fact. The relevant fact happens to be something the model didn't explicitly infer yet. Instead of inventing something on the fly to fill in the blank, as it'd do if it were mimicking a confidently-written document, it realizes it doesn't know that fact yet, backpedals, and proceeds to infer it.

Thoughts?

[-]kave1y22

I'm not sure whether this is important to the main thrust of the post, but I disagree with most of this paragraph:

Again, they're an expert in the field -- and this is the sort of claim that would be fairly easy to check even if you're not an expert yourself, just by Googling around and skimming recent papers. It's also not the sort of claim where there's any obvious incentive for deception. It's hard to think of a plausible scenario in which this person writes this sentence, and yet the sentence is false or even controversial.

In my experience, it's quite hard to check what "the gold standard" of something is, particularly in cutting-edge research fields. There are lots of different metrics on which methods compete, and it's hard to know their importance as an outsider.

And the obvious incentive for deception is that the physics prof works on NPsM, and so is talking it up (or has developed a method that beats NPsM on some benchmark, and so is talking it up to impress people with their new method ...)

[-]nostalgebraist1y42

Yeah, you're totally right -- actually, I was reading over that section just now and thinking of adding a caveat about just this point, but I worried it would be distracting.

But this is just a flaw in my example, not in the point it's meant to illustrate (?), which is hopefully clear enough despite the flaw.

^{^}

I include the caveat "at least not to the same extent" because of nuances involving LMs doing CoT-style reasoning, LMs "reminding themselves" of things that they in-some-sense "know" yet sometimes "forget," etc.

^{^}

For instance, one obvious approach would be to start off with HH chat tuning (producing an "expert" that bullshits when it doesn't know the answer), and then do a second tuning phase on text generated by this "expert" that encourages it to be more cautious in cases where the originally generated text wasn't actually-true (and/or where its content was inconsistent across multiple sampling runs, or something).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

40

instruction tuning and autoregressive distribution shift

40

40