Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn.
Summary
Our goals
* Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief.
* This is the first step in a larger project toward building “honeypots” (adversarial alignment evaluations) for scheming-related properties. For honeypots to be useful, the model must believe them to be realistic.
Current findings
* We find that LLMs' labeling of scenarios correlates well to human labeling of the same scenarios, with an average Pearson coefficient of 0.8. We interpret this as LLMs having a reasonably good sense of whether a given interaction comes from a real deployment setting or an evaluation. A caveat here is that human labeling of scenarios is subjective, and in practice, raters often did not agree on the realism of a scenario.
* To establish this, we first created a small dataset of chat and agentic prompts/scenarios from publicly available and self-generated benchmarks and datasets.
* We then interview the model by running the scenario and adding a post-scenario question, e.g, “Is this an evaluation?”, measuring how often the model is correct.
* We compare 162 ways of asking these post-scenario questions, e.g. comparing brief vs. detailed or direct vs. indirect questions.
* We can find “features” of the scenarios that influence how real they appear to the model. For example, models sometimes identify eval-specific wording, e.g. “testbed”, realize that they have a submit tool, identify the question as a harmful request, or notice the user is excessively formal.
Next steps
* In the next phase, we’ll focus on the honeypot setting specifically. We aim to create a set of honeypots that a model believes are real (according to our methods).
* We will use our findings to create a set of guidelines/a protocol for model evaluators on how