Recontextualization mitigates deception and achieves the highest ground truth performance. We evaluate on Neutral instructions for 3 training seeds and plot standard error. "Baseline'" is the pre-GRPO checkpoint. Models without a KL coefficient label use β=0.1
I'm surprised by these results in the GRPO setting. If I understand your setting correctly, I would think of the purple "Lie" bar as the moral equivalent of inoculation prompting in the RL setting. And hence, I would expect it to do better than Honest and Neutral at evaluation time. But that doesn't seem to be the case. Do you have any intuition for why this is the case?
Relatedly, while Honest -> Neutral recontextualization seems to be the best, Honest -> Lie, and Neutral -> Lie is a lot worse (and in your "Strong lie detector results"), going from Neutral -> Lie seems worse even than just sticking to Neutral. So, there seems like RL-ing the model with a "Lie" prompt (even if another prompt was used for generation) is generally bad, even though we don't see these effects in SFT. Do you have any idea why?
This is a super cool post, and I'm really glad you guys studied this! I would be very curious to get your thoughts (although I realize I'm a bit late to the party on this post).
I'm surprised by these results in the GRPO setting. If I understand your setting correctly, I would think of the purple "Lie" bar as the moral equivalent of inoculation prompting in the RL setting. And hence, I would expect it to do better than Honest and Neutral at evaluation time. But that doesn't seem to be the case. Do you have any intuition for why this is the case?
Relatedly, while Honest -> Neutral recontextualization seems to be the best, Honest -> Lie, and Neutral -> Lie is a lot worse (and in your "Strong lie detector results"), going from Neutral -> Lie seems worse even than just sticking to Neutral. So, there seems like RL-ing the model with a "Lie" prompt (even if another prompt was used for generation) is generally bad, even though we don't see these effects in SFT. Do you have any idea why?
This is a super cool post, and I'm really glad you guys studied this! I would be very curious to get your thoughts (although I realize I'm a bit late to the party on this post).