Consistency Training while Mitigating Obfuscation via Rate Matching
by Sohaib Imran, Prakhar Gupta, Jannes Elstner, and David Africa
Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa Links: Paper | Code TL;DR. * Models condition their behavior on extraneous input features in undesirable ways — for example, on evaluation-likeness (resulting in evaluation gaming), or on the user's preferred answer (resulting in sycophancy). * Consistency training teaches a model...
Jul 135