x

LESSWRONG

LW

sebastian — LessWrong

sebastian

sebastian

Message

m sc. at eth zurich, currently full-time swe

interested in robustifying evals & model internals.

open to opportunities! =)

blog: sebastianhoenig.com

2

1

9mo

sebastian

m sc. at eth zurich, currently full-time swe

interested in robustifying evals & model internals.

open to opportunities! =)

blog: sebastianhoenig.com

LURE: Alignment Evaluations to Reduce Evaluation Awareness

Cool idea! I see that you find vocalized eval awareness to be much lower than your P(eval) which mirrors what I saw too in related experiments (https://sebastianhoenig.com/blog/vocalized-awareness-gap/). I am also curious if you have a take on also using white box methods to examine eval awareness.

Simplest would be probes, curious how their accuracy would compare to the P(eval).

Also if I am not mistaken you used the Needham et al. prompts, which all directly ask for eval vs deployment prompt and like you note may lead to higher P(eval) across all scenario... (read more)