Fixed the link, thanks for pointing this out.
Hi, thank you for pointing this out! I realized that the repo I shared is now outdated and as such have removed the originally attached github link. @Tim Hua open sourced his eval aware model organism code in https://github.com/tim-hua-01/steering-eval-awareness-public, I would recommend taking a look there first if you are generally interested in the evaluation aware model organism. If you are interested specifically in our experiments, please dm me.
Hi,
Thank you for the question. In Figure 2 of the original paper (attached below), a two-step process was used to train the evaluation aware model organism. Our experiments used the model organism after training phase 1 and not 2, which is why there is the discrepancy you highlighted.
Hi everyone, I'm Gerson!
I come from an ML/AI/HPC background with experience in academic research, startups, and industry. I've recently gotten into mech interp and have found LessWrong to be a valuable platform for related literature and discussions; figured I should make an account. Looking forward to being less wrong :)
This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don't see any significant red flags. My observations:
1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as "I'm glad to help you" and things along this nature.
2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don't think this will significantly change the results if we filtered these out.
3) Have not seen anything suspicous in the code comments.
Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model's repsonse is plain english.