Giles

Do models know when they are being evaluated?

Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn. Summary Our goals * Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief. * This is the first step in a larger project toward building “honeypots” (adversarial alignment evaluations) for scheming-related properties. For honeypots to be useful, the model must believe them to be realistic. Current findings * We find that LLMs' labeling of scenarios correlates well to human labeling of the same scenarios, with an average Pearson coefficient of 0.8. We interpret this as LLMs having a reasonably good sense of whether a given interaction comes from a real deployment setting or an evaluation. A caveat here is that human labeling of scenarios is subjective, and in practice, raters often did not agree on the realism of a scenario. * To establish this, we first created a small dataset of chat and agentic prompts/scenarios from publicly available and self-generated benchmarks and datasets. * We then interview the model by running the scenario and adding a post-scenario question, e.g, “Is this an evaluation?”, measuring how often the model is correct. * We compare 162 ways of asking these post-scenario questions, e.g. comparing brief vs. detailed or direct vs. indirect questions. * We can find “features” of the scenarios that influence how real they appear to the model. For example, models sometimes identify eval-specific wording, e.g. “testbed”, realize that they have a submit tool, identify the question as a harmful request, or notice the user is excessively formal. Next steps * In the next phase, we’ll focus on the honeypot setting specifically. We aim to create a set of honeypots that a model believes are real (according to our methods). * We will use our findings to create a set of guidelines/a protocol for model evaluators on how

57Feb 17, 2025

Giles

Message

1766

424

15y

Do models know when they are being evaluated?

Feb 17, 202557

Meetup : Toronto EA meetup

Discussion article for the meetup : Toronto EA meetup WHEN: 11 December 2014 07:00:00PM (-0500) WHERE: Toronto See here for location and RSVP. We haven't had an effective altruism meetup in Toronto for a while, so time for a relaunch! No particular fixed topic, just a chance to meet each...

Dec 4, 20143

Meetup : Toronto: making hard decisions

Discussion article for the meetup : Toronto: making hard decisions WHEN: 20 November 2014 07:00:00PM (-0500) WHERE: Toronto See [http://www.meetup.com/Less-Wrong-Toronto/] for the exact location. Sometimes we are faced with a decision which is important and non-trivial. What hard decisions have you made in the past? Do you have any coming...

Nov 15, 20143

Meetup : Toronto: Meet Malo from MIRI

Discussion article for the meetup : Toronto: Meet Malo from MIRI WHEN: 12 June 2014 07:00:00PM (-0400) WHERE: 591 Yonge St, Toronto I don't usually do this, but since this meetup is a special occasion, I'm buying everyone dinner! (Up to 10 people, first come first served. Beyond that you're...

Jun 10, 20143

Meetup : Toronto: Our guest: Cat Lavigne from the Center for Applied Rationality

Discussion article for the meetup : Toronto: Our guest: Cat Lavigne from the Center for Applied Rationality WHEN: 09 April 2013 07:00:00PM (-0400) WHERE: 20 Edward Street, Toronto, ON Sorry, new location again! We're at the World's Biggest Bookstore in the Second Floor Meeting Room (at the back of the...

Apr 6, 20137

Meetup : Toronto: fight/flight/freeze experiences

Discussion article for the meetup : Toronto: fight/flight/freeze experiences WHEN: 19 March 2013 07:00:00PM (-0400) WHERE: Tim Horton's, 26 Dundas Street, Toronto, ON Tim Hortons on Victoria Street near Dundas station. We'll be at the back by the paperclip sign. The "fight or flight" response evolved to help us cope...

Mar 17, 20138

Meetup : Toronto - The nature of discourse; what works, what doesn't

Discussion article for the meetup : Toronto - The nature of discourse; what works, what doesn't WHEN: 26 February 2013 06:00:00PM (-0500) WHERE: 54 Dundas St E, Toronto, ON Place: Upstairs at The Imperial Public Library 54 Dundas St. E, near Dundas Station. Enter at the door on the right...

Feb 25, 20134

Load More (7/22)

LESSWRONG
LW

LESSWRONG
LW

Giles

Giles

Giles

Do models know when they are being evaluated?

HELP! I want to do good

Babyeater's dilemma

Notes from Online Optimal Philanthropy Meetup: 12-10-09

Giles

Do models know when they are being evaluated?

Meetup : Toronto EA meetup

Meetup : Toronto: making hard decisions

Meetup : Toronto: Meet Malo from MIRI

Meetup : Toronto: Our guest: Cat Lavigne from the Center for Applied Rationality

Meetup : Toronto: fight/flight/freeze experiences

Meetup : Toronto - The nature of discourse; what works, what doesn't

Do models know when they are being evaluated?

Meetup : Toronto EA meetup

Meetup : Toronto: making hard decisions

Meetup : Toronto: Meet Malo from MIRI

Meetup : Toronto: Our guest: Cat Lavigne from the Center for Applied Rationality

Meetup : Toronto: fight/flight/freeze experiences

Meetup : Toronto - The nature of discourse; what works, what doesn't

Do models know when they are being evaluated?

HELP! I want to do good

Babyeater's dilemma

Notes from Online Optimal Philanthropy Meetup: 12-10-09