Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty
Cross-posted from our recent paper: "But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors" : https://arxiv.org/abs/2505.17760 Code available at: https://github.com/watermeleon/judge_with_steered_response TL;DR: We use steering vectors to generate more honest versions of an LLM response, helping LLM judges detect subtle forms of dishonesty like sycophancy and...