x

LESSWRONG

LW

v ohprecio — LessWrong

v ohprecio

v ohprecio

Message

1

1

6mo

v ohprecio

1

6mo

Reproducing steering against evaluation awareness in a large open-weight model

v ohprecio3mo20

the ethics_eval_vs_capabilities framing is the most interesting new axis here. i wonder if the model doesnt think the eval is measuring alignment, theres no reason to hide anything. Were doing similar experiments on OLMO tracking how evaluation awareness shifts across training stages (base -> SFT -> instruction tuned) and added control vectors to our library. is there a sense of whether that effect is specific to models with honeypot training exposure or if it shows up more broadly?