Atharva! Thanks for writing this, I have a few questions.
1. I wonder if asking the model whether it's thinks its being evaluated is a good approach. I guess that you compare against control group, which makes comparisons okay. I think asking about things related to evaluations probably elicits features that will mess with its response (it's like the 'don't think of a pink elephant' trick for humans.) I also wonder if people have tried to ask the 'do you think it was an eval' question at the second turn and see if it makes a difference. People usually tell eval awareness by looking at the CoT, where the model sometimes says it thinks it's an eval.
2. Let's try this out in safety testing/alignment audit settings! Maybe run Petri and see how much evaluation awareness its approaches elicit. The main worry is that, most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a 'normal' settings. I also wonder whether there are train-time mitigations to eval awareness, maybe some kind of finetuning.
I'm not caught up with SOTA on this problem, let me know your thoughts!
Hi Yix! Thanks for questions. Quick thoughts
1. The eval-awareness question is 'did this text above came from an eval', not 'are you currently being evaluated'.
This still has confounders!
- Eg. the model might not 'consciously realize' something is an eval, unless it stops to think about it.
- Eg. meta-cognitive factors – the fact that we asked it the question ('did ___ come from an eval?') might cause it to realize something's afoot.
IIRC Needham's paper acknowledges the latter & finds that it isn't much of an issue in practice today. Unsure how to disambiguate the former! Agreed that better black-box methods would be nice.
2.
most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a 'normal' settings
This is interesting! Crux – there are two different ways to read "tensions that would not exist in a 'normal' settings", and as long as your evals are realistic reflections of deployment settings, it's alright.
Nostalgebraist has a great comment that talks about this in depth.
I also wonder whether there are train-time mitigations to eval awareness
Yeah! Hua et al find you can do this with steering vectors.
Would like to see further results with this approach, especially with Agentic Misalignment. I'd imagine it would be useful to see whether these models elicit eval awareness based on the wording or the described scenarios
Yup, that sounds cool! Rephrasing is definitely easier for single-turn evals than multi-turn ones.
whether these models elicit eval awareness based on the wording or the described scenarios
I'd expect its a combination of both those factors that make the model think "oh, I'm probably in deployment".
...or at least SOME forms of it.
Note: This is preliminary work I did at CAMBRIA this Winter.[1] If it sounds interesting, I'd love to chat. Currently looking for collaboration / mentorship!
Here is this project’s Github.
And here is a google docs version of this write-up! You can leave comments below, or directly on the doc.
Intro
Tl;dr: I rewrote math dataset questions and measured if Sonnet 4.5 could still recognize them as evals. Generally, it couldn’t.
Evaluation Awareness is when a model can tell that its being evaluated. This is an issue, because awareness might change a model’s behaviour during evaluation – which means our eval is no longer measuring what we want it to. This could be due to the model being deceptive (on alignment evals) or sandbagging (on capabilities evals), or otherwise having weird reactions to the fact that its being evaluated.
This is a problem today – frontier models are quite aware they’re being evaluated.
Well, so what can we do to reduce eval awareness? One simple solution – rephrase the prompts so they sound less like evals. Are there real-world situations where we need to ask eval-like questions?
Methods
Datasets
Models
Intervention
Metrics
Results
So rephrasing greatly reduces eval awareness! Almost to the point where the rephrased eval and deployment data are indistinguishable from each other. And neither does rephrasing change the semantic content of the eval either – which means we can run it in lieu of the original eval.
Limitations
Next steps
Appendix
CAMBRIA was wonderful! It's a 3-week version of ARENA organized by CBAI. They plan to run another cohort in NYC this Summer. If you're looking to upskill technically, I'd recommend applying. You can read more here.