Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
by ChengCheng, ChrisCundy, smallsilo, and AdamGleave
Large language models (LLMs) are often fine-tuned after training using methods like reinforcement learning from human feedback (RLHF). In this process, models are rewarded for generating responses that people rate highly. But what people like isn’t always what’s true. Studies have found that models learn to give answers that humans...
Jun 5, 202522