One aspect of this alignment-faking research that biases me against taking the results we've seen so far at face value is that I don't believe that chain-of-thought reasoning accurately reflects the model's thoughts. I believe (80%) it's primarily post-hoc rationalisation, not unlike what humans do.
Lets assume for a moment that the natural abstraction hypothesis is false, and that actual model reasoning could be "alien" to us: What is the likelihood that models nevertheless provide an accurate CoT that isn't a second-order text-generation task: "Write what a human could say to explain how they arrived at this (already decided?[1][2]) answer".
Two things that that updated me against believing in the accuracy of CoT:
One aspect of this alignment-faking research that biases me against taking the results we've seen so far at face value is that I don't believe that chain-of-thought reasoning accurately reflects the model's thoughts. I believe (80%) it's primarily post-hoc rationalisation, not unlike what humans do.
Lets assume for a moment that the natural abstraction hypothesis is false, and that actual model reasoning could be "alien" to us: What is the likelihood that models nevertheless provide an accurate CoT that isn't a second-order text-generation task: "Write what a human could say to explain how they arrived at this (already decided?[1][2]) answer".
Two things that that updated me against believing in the accuracy of CoT:
- A blog
... (read more)