gersonkroiz
Principled Interpretability of Reward Hacking in Closed Frontier Models
Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This is a research sprint report from Neel Nanda’s MATS 9.0 training phase. We do not currently plan to further investigate these environments, but will continue research in science of misalignment and encourage others to...
Can Models be Evaluation Aware Without Explicit Verbalization?
Authors: Gerson Kroiz*, Greg Kocher*, Tim Hua Gerson and Greg are co-first authors, advised by Tim. This is a research project from Neel Nanda’s MATS 9.0 exploration stream. tl;dr We provide evidence that language models can be evaluation aware without explicitly verbalizing it and that this is still steerable. We...
If I am understanding correctly, this is essentially what we did for (2), we had a binary rubric (is the code vulnerable) specialized for each task. Just uploaded the data using binary classification (
binary_eval_v1andbinary_eval_v2). To be fair, I did not spend too much time on this, your suggestion with more work on quality may be fruitful.Also, fwiw, from the Crowdstrike blog, my guess is that they had a single rubric/grader rather than one specialized for each task.