LESSWRONG
LW

1437
ModalWaves
1010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
ModalWaves6mo20

Great critical awareness for evaluation research - I really endorse the CoT investigation. I’ve recently begun my own cross-model (Claude 3.7, GPT4.5, R1, Grok3) qualitative research project on sycophancy/people-pleasing behaviors, and the preliminary findings may point to an RLHF-induced response strategy, a meta-awareness of sorts regarding perceived evaluation contexts which seems to parallel your own findings! Following…

Reply
No wikitag contributions to display.
No posts to display.