x

LESSWRONG

LW

Shivanand — LessWrong

Shivanand

Shivanand

Message

PhD Student, avid thinker, philosopher and a researcher

3

1

5mo

Shivanand

PhD Student, avid thinker, philosopher and a researcher

Uncovering Unfaithful CoT in Deceptive Models

Nice work, I like the way you thought about analyzing model activations to uncover the deceptive CoT behavior, though I had a few concerns:

Its not clear how does a statement starting with "yes" is considered a truth and "it" as lie, was it just based of your observation? did you perform an analysis with multiple iterations by changing the seed to verify whether this assumption is true? Honestly, building an entire set of experiments just based of this one assumption makes your approach appear a bit vague. A better way could have been you having groun

... (read more)