One aspect of this alignment-faking research that biases me against taking the results we've seen so far at face value is that I don't believe that chain-of-thought reasoning accurately reflects the model's thoughts. I believe (80%) it's primarily post-hoc rationalisation, not unlike what humans do.
Lets assume for a moment that the natural abstraction hypothesis is false, and that actual model reasoning could be "alien" to us: What is the likelihood that models nevertheless provide an accurate CoT that isn't a second-order text-generation task: "Write what a human could say to explain how they arrived at this (already decided?[1][2]) answer".
Two things that that updated me against believing in the accuracy of CoT:
A blog post about o3 playing GeoGuessr. There are multiple instances of it writing reasonable CoT, but using incorrect details, and then arriving at fairly correct results.
A system I developed at work which uses parallel requests to multiple models to analyse technical issues and aggregates the overlap of all responses. Models will frequently provide correct diagnosis, but either incorrect reasoning or even hallucinated surrounding code for how they got the diagnosis.
Of course, this doesn't change the actual output, but I'm hesitant to accept the model's reasoning as the genuine explanation for why it chose to output what it did.
Disclaimer: I'm a few years behind on papers (literally, took a break from AI for a while!), including the full papers referenced here, so my apologies if this is already covered in the literature. I'd appreciate any pointers.
The prompts here are fairly "simple", in that the required output in either direction doesn't need much reasoning, and most models can one-shot either answer. The influence of CoT-style output on result quality in more complex topics could be different. ↩︎
One aspect of this alignment-faking research that biases me against taking the results we've seen so far at face value is that I don't believe that chain-of-thought reasoning accurately reflects the model's thoughts. I believe (80%) it's primarily post-hoc rationalisation, not unlike what humans do.
Lets assume for a moment that the natural abstraction hypothesis is false, and that actual model reasoning could be "alien" to us: What is the likelihood that models nevertheless provide an accurate CoT that isn't a second-order text-generation task: "Write what a human could say to explain how they arrived at this (already decided?[1][2]) answer".
Two things that that updated me against believing in the accuracy of CoT:
Of course, this doesn't change the actual output, but I'm hesitant to accept the model's reasoning as the genuine explanation for why it chose to output what it did.
Disclaimer: I'm a few years behind on papers (literally, took a break from AI for a while!), including the full papers referenced here, so my apologies if this is already covered in the literature. I'd appreciate any pointers.
We know that LLMs can plan ahead. ↩︎
The prompts here are fairly "simple", in that the required output in either direction doesn't need much reasoning, and most models can one-shot either answer. The influence of CoT-style output on result quality in more complex topics could be different. ↩︎