When AIs don’t have perfect control over their actions and sometimes slip up, misaligned AIs critiquing AI-generated outputs with verifiable arguments can catch failures.
This post quickly explains a property which I find interesting, but doesn't aim at analyzing in depth the problem of AI critique with misaligned models.
Thanks to Ryan Greenblatt for sharing this observation with me.
Let’s assume that on some distribution of difficult tasks, the overseer can’t tell if code snippets generated by the policy contain backdoors when looking at the code, but the overseer can check explanations of backdoors. So we train a policy and a critique model on a distribution of simple tasks with human-generated answers and backdoors.
Such AI critique won’t work if models never slip up: the policy could just always add backdoors on obviously difficult tasks, and the critique model could only point out backdoors which look man-made.
If policy sometimes slips up, it doesn’t help: Let’s consider a policy which outputs code snippets non-deterministically (like current LLMs sampled at temperature 1), and such that on hard tasks, 1% of the output probability mass is on code snippets without backdoors. The overseer can’t take advantage of it since it doesn’t know that this part of the distribution is special.
If the critique sometimes slips up, it helps a lot: Let’s consider a critique model which outputs code critiques non-deterministically, and such that on code snippets with a backdoor which doesn’t look man-made, 1% of the output probability mass is on the explanation of the backdoor. If there aren’t too many false positive (convincing explanation of non-existent backdoors), the overseer can sample the critique model many times and check the correctness of every explanation to know with high confidence if a code snippet has a backdoor or not. This naive strategy might be very costly, but there might be safe ways to distill it.
Slipping up is not about being uncertain: at least in this simplified setting, the optimal policy is deterministic even if the model is unsure if the task is difficult or not. If the model is scheming against us, it should put 0% of its probability mass on “pointing out the backdoor” below a certain threshold, even if its subjective estimate for “the backdoor is man-made” is strictly greater than 0%.
But this determinism is unlikely to happen in practice with large language models, as they:
Moreover, for most tasks, slipping up in this fashion does not result in a measurably higher loss, so this might not become false with simply more scale.
But if LLMs are used in scaffolds where they read their own outputs (including critiques) multiple times and have the option to discard them before the overseer looks at them, then such a system would have a greater control over its outputs and be more deterministic. For this reason, it might be important for overseers to be able to read the entire output of AIs.