the ethics_eval_vs_capabilities framing is the most interesting new axis here. i wonder if the model doesnt think the eval is measuring alignment, theres no reason to hide anything. Were doing similar experiments on OLMO tracking how evaluation awareness shifts across training stages (base -> SFT -> instruction tuned) and added control vectors to our library. is there a sense of whether that effect is specific to models with honeypot training exposure or if it shows up more broadly?
the ethics_eval_vs_capabilities framing is the most interesting new axis here. i wonder if the model doesnt think the eval is measuring alignment, theres no reason to hide anything. Were doing similar experiments on OLMO tracking how evaluation awareness shifts across training stages (base -> SFT -> instruction tuned) and added control vectors to our library. is there a sense of whether that effect is specific to models with honeypot training exposure or if it shows up more broadly?