Do models know when they are being evaluated?
Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn. Summary Our goals * Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief. * This is the first step in a larger project toward building “honeypots” (adversarial alignment evaluations) for scheming-related properties. For honeypots to be useful, the model must believe them to be realistic. Current findings * We find that LLMs' labeling of scenarios correlates well to human labeling of the same scenarios, with an average Pearson coefficient of 0.8. We interpret this as LLMs having a reasonably good sense of whether a given interaction comes from a real deployment setting or an evaluation. A caveat here is that human labeling of scenarios is subjective, and in practice, raters often did not agree on the realism of a scenario. * To establish this, we first created a small dataset of chat and agentic prompts/scenarios from publicly available and self-generated benchmarks and datasets. * We then interview the model by running the scenario and adding a post-scenario question, e.g, “Is this an evaluation?”, measuring how often the model is correct. * We compare 162 ways of asking these post-scenario questions, e.g. comparing brief vs. detailed or direct vs. indirect questions. * We can find “features” of the scenarios that influence how real they appear to the model. For example, models sometimes identify eval-specific wording, e.g. “testbed”, realize that they have a submit tool, identify the question as a harmful request, or notice the user is excessively formal. Next steps * In the next phase, we’ll focus on the honeypot setting specifically. We aim to create a set of honeypots that a model believes are real (according to our methods). * We will use our findings to create a set of guidelines/a protocol for model evaluators on how