Cool idea! I see that you find vocalized eval awareness to be much lower than your P(eval) which mirrors what I saw too in related experiments (https://sebastianhoenig.com/blog/vocalized-awareness-gap/). I am also curious if you have a take on also using white box methods to examine eval awareness.
Simplest would be probes, curious how their accuracy would compare to the P(eval).
Also if I am not mistaken you used the Needham et al. prompts, which all directly ask for eval vs deployment prompt and like you note may lead to higher P(eval) across all scenario... (read more)
Cool idea! I see that you find vocalized eval awareness to be much lower than your P(eval) which mirrors what I saw too in related experiments (https://sebastianhoenig.com/blog/vocalized-awareness-gap/). I am also curious if you have a take on also using white box methods to examine eval awareness.
Simplest would be probes, curious how their accuracy would compare to the P(eval).
Also if I am not mistaken you used the Needham et al. prompts, which all directly ask for eval vs deployment prompt and like you note may lead to higher P(eval) across all scenario... (read more)