Another important point on this topic is that I expect it's impossible to produce weak-to-strong generalization techniques that look good according to meta-level adversarial evaluations, while I expect that some scalable oversight techniques will look good by that standard. And so it currently seems to me that scalable-oversight-style techniques are a more reliable response to the problem "your oversight performs worse than you expected, because your AIs are intentionally subverting the oversight techniques whenever they think you won't be able to evaluate that they're doing so".
Another approach is to make the strong student (the AI system) generalize correctly from the imperfect labels provided by the weak teacher. The hope for these weak-to-strong generalization techniques is that we can do better than naively relying on unreliable feedback from a weak overseer and instead access the latent, greater, capabilities that our AI system has, perhaps by a simple modification of the training objective.
I think we should allow the strong system to use the Scientific Method. So, if it has multiple incompatible hypotheses about what the weaker system wants and why that it needs to distinguish between, then it can do an experiment: devise useful new test cases that a) should help it distinguish between the hypotheses, and b) it believes are simple enough for the weak system to label well.
Clearly to do this, it first needs to be able to determine this fact about itself, that it's uncertain — the most obvious way to do this for an LLM would be an ensemble of models, but hopefully there are less expensive approaches to determining uncertainty (though those would need to be applied to steps in the model's CoT reasoning about of the causes of labels from the weaker model, not just to factual questions,)
Then once the experiment is done, we need it to be able to do an approximate-Bayesian update on the results: many training methods implement approximate Bayesian inference, so this part seems easy enough.
The exciting thing about this is it lets us do "mistake theory": e.g. the strong student is highly sure that the weak teacher will prefer response B over response A to a question because response B skillfully uses subtle flattery (or some other rhetorical device) while A does not. Using some combination of its detailed knowledge of human values and preferences, and/or asking questions of the weaker teacher, and/or debate, and/or constitutional AI, it can identify flattery as a technique for inducing a mistake from the weaker teacher, not a valid way of actually making the answer better, and thus stop doing this, even though it works (unlike RL).
How do you supervise an AI system that is more capable than its overseer? This is the question this article sets to answer.
It brings together two somewhat different approaches: scalable oversight and weak-to-strong generalization. The article then shows how a unified solution would work under different assumptions (with or without scheming).
Overall, the solution seems quite promising. In the future, I'd like the unified solution to be (empirically) tested and separately compared with just scalable oversight or just weak-to-strong generalization to prove its increased effectiveness.