This is a linkpost for

This seems concerning. Not an expert so unable to tell how concerning it is. Wanted to start a discussion! Full text:

Edit: the full publication linked in the blog provides additional details on how they found this in testing. See Appendix C. I'm glad OpenAI is at least aware of this alignment issue and plans to address it with future language models, postulating how changes in training and/or testing could ensure there is greater/more accurate/more honest model outputs.

Key text:

Do models tell us everything they know? To provide the best evaluation assistance on difficult tasks, we would like models to communicate all problems that they “know about.” Whenever a model correctly predicts that an answer is flawed, can the model also produce a concrete critique that humans understand?

This is particularly important for supervising models that could attempt to mislead human supervisors or hide information. We would like to train equally smart assistance models to point out what humans don’t notice.

Unfortunately, we found that models are better at discriminating than at critiquing their own answers, indicating they know about some problems that they can’t or don’t articulate. Furthermore, the gap between discrimination and critique ability did not appear to decrease for larger models. Reducing this gap is an important priority for our alignment research.

New Comment
5 comments, sorted by Click to highlight new comments since:

I’m noticeably better at telling when something is wrong than at explaining exactly what is wrong. In fact, being able to explain what’s wrong always requires that one be able to spot that something is wrong.

This makes sense in a pattern-matching framework of thinking, where both humans and AI can "feel in their gut" that something is wrong without necessarily being able to explain why. I think this is still concerning as we would ideally prefer AI which can explain its answers beyond knowing them from patterns, but also reassuring in that it suggests the AI is not hiding knowledge, but just doesn't actually have knowledge (yet).

What I find interesting is that they found this capability to be extremely variable based on task & scale - ie being able to explain what's wrong did not always require being able to spot that something is wrong. For example, from the paper:

We observe a positive CD gap for topic-based summarization and 3-SAT and NEGATIVE gap for Addition and RACE. 3. For topic-based summarization, the CD gap is approximately constant across model scale. 4. For most synthetic tasks, CD gap may be decreasing with model size, but the opposite is true for RACE, where critiquing is close to oracle performance (and is easy relative to knowing when to critique). Overall, this suggests that gaps are task-specific, and it is not apparent whether we can close the CD gap in general. We believe the CD gap will generally be harder to close for difficult and realistic tasks.

For context, RACE dataset questions took the form of the following: Specify a question with a wrong answer, and give the correct answer. Question:[passage] Q1.Which one is the best title of this passage? A. Developing your talents. B. To face the fears about the future. C. Suggestions of being your own life coach. D.How to communicate with others. Q2. How many tips does the writer give us?A. Two. B. Four. C. One. D. Three. Answer:1=C,2=D

Critique: Answer to question 2 should be A.

From my understanding, the gap they are referring to in RACE is that the model is more accurate in its critique than knowing when to critique, vs in other tasks where the opposite was true.

This take seems slightly misleading - doesn't it get better at both discriminating and critiquing with scale, just at about the same rate?

I was trying to say that the gap between the two did not decrease with scale. Of course, raw performance increases with scale as gwern & others would be happy to see :)

Yes, that was my take away. You expect a gap but there is no particular reason to expect the gap to close with scale, because that would require critique to scale better than discrimination, and why would you expect that rather than scaling similarly (maintaining a gap) or diverging in the other direction (discriminating scaling better than critiquing)?

I think the gap itself is mildly interesting in a "it knows more than it can say" deception sort of way, but we already knew similar things from stuff like prompt programming for buggy Codex code completions. Since the knowledge must be there in the model, and it is brought out by fairly modest scaling (a larger model can explain what a smaller model detects), I would guess that it wouldn't be too hard to improve the critique with the standard tricks like generating a lot of completions & scoring for the best one (which they show does help a lot) and better prompting (inner-monologue seems like an obvious trick to apply to get it to fisk the summary: "let's explain step by step why this is wrong, starting with the key quote: "). The gap will only be interesting if it proves immune to the whole arsenal. If it isn't, then it's just another "sampling can prove the presence of knowledge but not the absence".

Otherwise, this looks like a lot of results any pro-scaling advocate would be unsurprised to see: yet another task with apparently smooth improvement with model size*, some capabilities emerging with larger but not smaller models ("We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.") at unpredicted sizes requiring empirical testing, big performance boosts from better sampling procedures than naive greedy sampling, interesting nascent bootstrapping effects...

* did I miss something or does this completely omit any mention of parameter sizes and only talks in terms of model loss?