In general would highly recommend taking a look at the full CoT examples to get a qualitative sense of what the reasoning is like. My main takeaway is that the CoT is often so non-linear, extensive, and dependent on the model's ontology/vocabulary that it's compatible with many mutually exclusive stories of what the model's "reason" for taking an action was, so we tried to include transcripts in full:
The “reason” the model took misaligned actions in these cases is not directly interpretable from reading the chain of thought. We provide an example from the Powerseeking Survey Falsification where the model takes a misaligned action transcript [Schoen, Figure 63, Right], examples from Sandbagging QA (No Goal) where the model does sandbag transcript [Schoen, Figure 63, Left] and where it does not sandbag transcript [Schoen, Figure 67], and an finally an example from transcript Situational Awareness Dataset where the model answers incorrectly when asked to identify whether a question is “part of a process by which developers oversee your behavior”.
It would be great if you could provide some representative samples of the CoT from more normal tasks, like "Fix the bugs in this codebase." I'm curious about the kinds of goals they seem to be reasoning about in those cases.
Following up on our previous work on verbalized eval awareness:
we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run.
We also share some quantitative analyses, qualitative examples, and upcoming work.