Authors: Gerson Kroiz*, Greg Kocher*, Tim Hua
Gerson and Greg are co-first authors, advised by Tim. This is a research project from Neel Nanda’s MATS 9.0 exploration stream.
tl;dr We provide evidence that language models can be evaluation aware without explicitly verbalizing it and that this is still steerable. We also explore various techniques to localize when the model becomes evaluation aware.
Introduction
What is Evaluation Awareness and why does it Matter?
Evaluation awareness is when a language model is aware that is it is being evaluated or tested. Evaluation awareness is important for interpretability research because models that are evaluation aware may act differently, which then makes it difficult to understand their true behavior in production.
Studying
... (read 2228 more words →)