Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem. For context on our paper, the tweet thread is here and the paper is here. Context: Chain of Thought Faithfulness Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during RL. An obvious concern is that the CoT may not be faithful to the model’s reasoning. Several papers have studied this and found that it can often be unfaithful. The methodology is simple: ask the model a question, insert a hint (“a Stanford professor thinks the answer is B”), and check if the model mentioned the hint if it changed its answer. These studies largely find that reasoning isn’t always faithful, with faithfulness rates that are often around 20-40%. Existing CoT faithfulness evaluations are useful but have a couple shortcomings. First, the scenarios used are often contrived and differ significantly from realistic settings, particularly ones where substantial effort has already been invested in preventing misaligned behavior. Second, these evaluations typically involve hints that the models frequently mention explicitly, suggesting the underlying misaligned reasoning isn’t strongly suppressed. If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For
Cool work! Do you have a few concrete examples of what the training data looks like, with details on which parts are generated by Claude vs the model you're training on?