One aspect of this alignment-faking research that biases me against taking the results we've seen so far at face value is that I don't believe that chain-of-thought reasoning accurately reflects the model's thoughts. I believe (80%) it's primarily post-hoc rationalisation, not unlike what humans do.
Lets assume for a moment that the natural abstraction hypothesis is false, and that actual model reasoning could be "alien" to us: What is the likelihood that models nevertheless provide an accurate CoT that isn't a second-order text-generation task: "Write what... (read more)
One aspect of this alignment-faking research that biases me against taking the results we've seen so far at face value is that I don't believe that chain-of-thought reasoning accurately reflects the model's thoughts. I believe (80%) it's primarily post-hoc rationalisation, not unlike what humans do.
Lets assume for a moment that the natural abstraction hypothesis is false, and that actual model reasoning could be "alien" to us: What is the likelihood that models nevertheless provide an accurate CoT that isn't a second-order text-generation task: "Write what... (read more)