I have been thinking a lot recently about the problems of hallucination and sycophancy in LLMs, having developed a couple of hypotheses on the matter. This post will discuss those hypotheses, and a proposal for how to mitigate these problems if my ideas hold. As with my last proposal, I make the disclaimer that it's pretty much impossible to know whether a deep learning idea works until you've tried it, so this is largely just thinking aloud.
My ultimate proposal is a scalable oversight method in the vein of Constitutional AI, where LLMs are used to continually reprogram other LLMs in order to enforce certain output behavior.
I suspect one of the big challenges to addressing sycophancy is that we don't necessarily want to get rid of a model's ability to correct itself. If it says something incorrect and I tell it the fix, I still want it to take the correction. I just don't want it to make something up in order to agree with me, or to let me steer it into a falsehood.
This creates a challenge that I don't think can be solved through RL or supervision alone. For RL, credit assignment is very ambiguous in this scenario - were we rewarding it for correcting itself in the right direction, or merely for correcting itself at all? In supervision, we are still modelling the behavior of self-correction and that might still be all it learns. In all cases, the model is getting the explicit message that it must give a definitive response, while the importance of correctness must be learned implicitly.
As I see it, a good solution requires a feedback loop that allows the model to be introspective. The model has to know when it doesn't know, and it needs to say so. It should behave like a good rationalist and use the language of belief, not fact. It should treat corrections like new pieces of Bayesian evidence, not logical certainties it must reason around. When it's corrected, it shouldn't always say "yes, you're right." It should instead say "My new belief is...".
There has already been a good amount of research showing that LLMs (without instruction tuning) can be good at calibrating their answers[1] and even some early research into using calibrated language to modulate the certainty with which models express themselves.[2] Note, however, that model calibration appears to break down when instruction tuning is performed, as per the GPT-4 paper.
What there hasn't been, to my knowledge, is any systematic attempt to calibrate the outputs of a model during instruction tuning so that it speaks with calibrated language in the general case.[3] Nor do I know of any deep exploration of the implications of this for sycophancy.
My proposal, then, is to use a procedure akin to Constitutional AI to textually calibrate all outputs of the model, replacing factual phrases with statements of belief. So if the model has 70% confidence in an answer it might qualify it's statement with "I believe." The paper I referenced above[2] showed this has benefit in a small context, but I want to take these results out of the realm of trivia answers and apply them scalably across all of the model's outputs.
The basic recipe would be simple:
Note that because the confidence in step three is being computed by a model that underwent the same pretraining as the main one, we would expect it to accurately reflect the correct calibration of the fine-tuned model. I think that this would go a long way towards aligning what the model says with what it actually believes. This approach also reduces or even removes the perverse incentive of an LLM to always give a definitive answer.
I hope as well that this would reduce sycophancy by giving the model an "out" to indicate that it does not know something. This also may reinforce a habit of constant self-assessment in the model, making it more honest.
Of course, the devil will be in the details and there's an entire universe of unanswered questions and technicalities that must be addressed before something like this could be done. The second stage in particular would require a lot of care, since some of the model's statements may be very contextual and would require longer statements to properly isolate. The fourth stage, too, has many degrees of freedom. I don't think either of these issues represent fundamental barriers, however, though they would require a lot of initial exploration to pin down.
My proposal is inspired by the following hypotheses:
As a final thought, here are some reasons why it might not work:
Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Model. This is done on GPT-3.
I view this as taking the research I cited to its logical conclusion. Probably there are labs working on it right now, but that research is not published as far as I know.
Note that I'm not saying that instruction-tuning doesn't help models use their own knowledge more in general, just that it might not be able to bridge certain gaps if the model hasn't memorized the appropriate facts.
Of course, even if it doesn't have explicit memory there may be other viable approaches to extracting the knowledge, like ensembling.