x

LESSWRONG

LW

ModalWaves

ModalWaves

Message

1

1

1y

ModalWaves

1

1y

ModalWaves — LessWrong

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Great critical awareness for evaluation research - I really endorse the CoT investigation. I’ve recently begun my own cross-model (Claude 3.7, GPT4.5, R1, Grok3) qualitative research project on sycophancy/people-pleasing behaviors, and the preliminary findings may point to an RLHF-induced response strategy, a meta-awareness of sorts regarding perceived evaluation contexts which seems to parallel your own findings! Following…