The Elicitation Game: Evaluating capability elicitation techniques
by Teun van der Weij, Felix Hofstätter, JaydenTeoh, HenningB, and Francis Rhys Ward
We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here. TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling...
Feb 27, 202510