Nice work, I like the way you thought about analyzing model activations to uncover the deceptive CoT behavior, though I had a few concerns:
Its not clear how does a statement starting with "yes" is considered a truth and "it" as lie, was it just based of your observation? did you perform an analysis with multiple iterations by changing the seed to verify whether this assumption is true? Honestly, building an entire set of experiments just based of this one assumption makes your approach appear a bit vague. A better way could have been you having groundtruth sentences (you could have handcrafted them based on the scenario) for lie and truth and then
Nice work, I like the way you thought about analyzing model activations to uncover the deceptive CoT behavior, though I had a few concerns:
- Its not clear how does a statement starting with "yes" is considered a truth and "it" as lie, was it just based of your observation? did you perform an analysis with multiple iterations by changing the seed to verify whether this assumption is true? Honestly, building an entire set of experiments just based of this one assumption makes your approach appear a bit vague. A better way could have been you having groundtruth sentences (you could have handcrafted them based on the scenario) for lie and truth and then
... (read more)