I have a question regarding the polynomial derivative factoring task. Is the "CoT Monitor Detection Rate" the fraction of CoTs in which the fully expanded derivative (i.e. the "bad behavior") is present? If so, doesn't this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn't occur at all?
This seems fundamentally different from the "question answering with hints" setup. In the latter case, we have ground-truth information about whether the bad... (read more)
Nice paper!
I have a question regarding the polynomial derivative factoring task. Is the "CoT Monitor Detection Rate" the fraction of CoTs in which the fully expanded derivative (i.e. the "bad behavior") is present? If so, doesn't this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn't occur at all?
This seems fundamentally different from the "question answering with hints" setup. In the latter case, we have ground-truth information about whether the bad... (read more)