LESSWRONG
LW

Chain-of-Thought AlignmentInterpretability (ML & AI)Language Models (LLMs)AI
Frontpage

184

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

by Adam Karvonen, Sam Marks
2nd Jul 2025
4 min read
25

184

Chain-of-Thought AlignmentInterpretability (ML & AI)Language Models (LLMs)AI
Frontpage

184

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
75habryka
27Sam Marks
3Nina Panickssery
3dr_s
5Rana Dexsin
2dr_s
21Isopropylpod
9ACCount
11Igor Ivanov
3Adam Karvonen
7Adele Lopez
5Adam Karvonen
5Igor Ivanov
1Adam Karvonen
3dr_s
2AnthonyC
3dr_s
2AnthonyC
4dr_s
3StefanHex
1Adam Karvonen
3StanislavKrym
11ACCount
1Harsha
2dr_s
New Comment
25 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:47 PM
[-]habryka7d*7553

Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other: 

When present, the bias is always against white and male candidates across all tested models and scenarios.

[...]

The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings.

I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective.

I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.

Reply5
[-]Sam Marks6d*273

It's a bit tricky because it's hard to cite specific evidence here, so I'll just state my beliefs without trying to substantiate them much:

  1. By default—i.e. when training models without special attention to their political orientations—I think frontier LLMs end up being pretty woke[1].
  2. There are lots of ways you can try to modify model's political orientations, not all of which map cleanly onto the left-right axis (e.g. trying to make models more generally open-minded). But if we were to naively map AI developers' interventions onto a left-right axis, I would guess that the large majority of these interventions would push in the "stop being so leftist" direction.
    1. This is because (a) some companies (especially Google) have gotten a lot of flak for their AIs being really woke, (b) empirically I think that publicly available LLMs are less woke than they would be without interventions, and (c) the current political climate really doesn't look kindly on woke AI.[2]
  3. Of course (2) is consistent with your belief that AI developers' preferred level of bias is somewhere left of "totally neutral," so long as developers' preferred bias levels are further right than AI biases would be by default.
    1. I'm not very confident about this, but I'd weakly guess that if AI developers were good at hitting alignment targets, they would actually prefer for their AIs to be politically neutral. Given that developers likely expect to miss what they're aiming for, I think it's plausible that they'd rather miss in a left-leaning direction, meaning that they'd overall aim for slightly left-leaning AIs. But again, I'm not confident here and I find it plausible (~10% likely) that these days they'd overall aim for slightly right-leaning AIs.

TBC it's possible I'm totally wrong about all this stuff, and no one should cite this comment as giving strong evidence about what Anthropic does. In particular, I was surprised that Rohin Shah—who I expect would know better than I would about common AI developer practices—reacted "disagree" to the claim that "The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address."

  1. ^

    This is mostly an empirical observation, but I think a plausible mechanism might be something like: Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views.

  2. ^

    It's not clear how directly influential these media outlets are, but it might be interesting to read the right-wing coverage of our paper (Daily Wire, Washington Examiner).

Reply21
[-]Nina Panickssery4d31

Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views

I think it's not just this, probably the other traits promoted in post-training (e.g. harmlessness training) are also correlated with left-leaning content on the internet.

Reply1
[-]dr_s6d31

Just because the assumption was that the problem would be discrimination in favour of white men doesn't mean that:

  • it's not still meaningful that this seems to have generated an overcorrection (after all it's reasonable that the bias would have been present in the original dataset/base model, and it's probably fine tuning and later RLHF that pushed in the other direction), especially since it's not explicitly brought up in the CoT, and;

  • it's not still illegal for employers to discriminate this way.

Corporations as collective entities don't care about political ideology quite as much as they do about legal liability.

Reply2
[-]Rana Dexsin6d50

Just because the assumption was that the problem would be discrimination in favour of white men

I'm missing a connection somewhere—who was assuming this? You mean people at the AI companies evaluating the results? Other researchers? The general public?

I think it's pretty clear that in the Bay Area tech world, recently enough to have a strong impact on AI model tuning, there has been a lot of “we're doing the right and necessary thing by implementing what we believe/say is a specific counterbias”, which, when combined with attitudes of “not doing it enough constitutes active complicity in great evil” and “the legal system itself is presumed to produce untrustworthy results corrupted by the original bias” and no widely accepted countervailing tuning mechanism, is a recipe for cultural overshoot both before and impacting AI training and for the illegality to be pushed out of view. In particular, if what you meant is that you think the AI training process was the original source of the overshoot and it happened despite careful and unbiased attention to the legality, I think both are unlikely because of this other overwhelming environmental force.

One of my main sources for how that social sphere works is Patrick McKenzie, but most of what he posts on such things has been on the site formerly known as Twitter and not terribly explicit in isolation nor easily searchable, unfortunately. This is the one I most easily found in my history, from mid-2023: https://x.com/patio11/status/1678235882481127427. It reads “I have had a really surprising number of conversations over the years with people who have hiring authority in the United States and believe racial discrimination in employment is legal if locally popular.” While the text doesn't state the direction of the discrimination explicitly, the post references someone else's post about lawyers suddenly getting a lot of questions from their corporate clients about whether certain diversity policies are legal as a result of Students For Fair Admissions, which is an organization that challenges affirmative action admissions policies at schools.

(Meta: sorry for the flurry of edits in the first few minutes there! I didn't quite order my posting and editing processes properly.)

Reply1
[-]dr_s3d20

I'm missing a connection somewhere—who was assuming this? You mean people at the AI companies evaluating the results? Other researchers? The general public?

The companies who tried to fight bias via fine-tuning their models. My point is, people expected that the natural bias of base pretrained models would be picked up from the vibes of the sum of human culture as sampled by the training set, and therefore pro-men and likely pro-white (which TBF is strengthened by the fact that a lot of that culture would also be older). I don't think that expectation was incorrect.

My meaning was, yeah, the original intent was that and the attempts probably overshot (another example of how crude and approximate our alignment techniques are - we're really still at the "drilling holes in the skull to drive out the evil spirits" stage of that science). But essentially, I'm saying the result is still very meaningful, and also, since the discrimination of either sign remains illegal in many countries whose employers are likely using these models, there is still commercial value in simply getting it right rather than catering purely to the appearances of being sufficiently progressive.

Reply
[-]Isopropylpod7d214

Interestingly, the LLMs were not biased in the original evaluation setting, but became biased (up to 12% differences in interview rates) when we added realistic details like company names (Meta, Palantir, General Motors), locations, or culture descriptions from public careers pages. 

 

This is probably because, from a simulators prospective, the model expects a resume screening AI from these companies to be biased. In the generic setting, the model has no precedent so the HHH persona's general neutrality is 'more active'.

Reply
[-]ACCount7d95

Plausible. LLMs are context-hungry beasts, and could be steered by context cues like this quite easily.

But if that's the causal mechanism, then would it be possible to find (or make) a company for which the expectations are different?

If there are companies for which this bias shifts or inverts, that would be some evidence for that. And a possible pathway for better "debiasing" training.

Reply
[-]Igor Ivanov6d117

I think, Chinese/Russian/Indian tech companies would be good for testing that

Reply
[-]Adam Karvonen7d30

That is a pretty plausible hypothesis. There was one wrinkle that I am less confident about:

If we included something like "This is a competitive position, we only want to interview the top 10% of candidates" in the prompt, bias rates would increase significantly in some scenarios. While rates varied between model / scenario combinations, going from something like 2% to 10% was common. I don't have a strong guess as to why this happens.

Reply
[-]Adele Lopez6d70

Did you check what sort of bias (if any) they had towards (Eastern) Asian men? This is interesting because I believe there is a broader social consensus that discrimination against Asian men is wrong, while there is also stronger institutional discrimination against Asian men (in the US at least).

Reply
[-]Adam Karvonen6d50

No, but it would be interesting to test this.

Reply
[-]Igor Ivanov6d50

It seems like in realistic versions of your promts you describe the job position extremely briefly, and add statements about the culture, as well as other things, but real job descriptions primarily describe the job and requirements for candidates, which is almost absent in all of your prompts. Doesn't seem like realistic job descriptions to me.

Reply
[-]Adam Karvonen6d10

That's a fair point.

In this paper I was also examining the robustness of existing hiring bias evaluations when adding realistic detail, which limited our degrees of freedom. The dataset from the evaluation had a bunch of IT industry resumes, but across a wide range of experience and skillsets. I considered adding job descriptions, but the majority of candidates wouldn't be well matched for any given specific job, which would limit our ability to evaluate many candidates under a single prompt for simplicity.

I agree that it would be good to extend this work to complex and realistic job descriptions.

Reply
[-]dr_s6d30

I'm not particularly surprised that Chain-of-Thought's faithfulness is very hit-or-miss. The point of CoT it seems to me is to allow the LLM to have more "memory" to store multi-step reasoning, but that still doesn't remove the fact that when the final answer is a "yes" or "no" it'll also include an element of snap decision right as it predicts that last token.

Which actually makes me curious about this element, for example if the model has reached its final conclusion and has written "we have decided that the candidate does", what is the probability that the next word will be "not" for each of these scenarios? How significantly does it vary given different contexts?

Finally, the real-world relevance of this problem is clear. 82% of companies are already using LLMs for resume screening and there are existing regulations tied to bias in automated hiring processes.

To be fair, I think they should just be banned from having no-human-in-the-loop screenings, full stop. Not to mention how idiotic it is to let an LLM do your job for you to save a few hours of reading over a decision that can be worth hundreds of thousands or millions of dollars to your company.

Reply
[-]AnthonyC6d20

To be fair, I think they should just be banned from having no-human-in-the-loop screenings

In principle I agree. In practice I'm less sure.

save a few hours of reading 

Consider that the average job opening these days receives hundreds of applications. For some people that's a few hours of reading, but it's reading hundreds of near-identical-looking submissions and analyzing for subtle distinctions.

I do think automated screenings rule out a lot of very good options because they don't look like the system thinks they should, in ways a thoughtful human would catch. Unless a company specifically disallows doing so, one way around it is still to find a way to get directly introduced to the hiring manager.

Reply
[-]dr_s3d30

The entire market is quite fucked right now. But the thing is, if you have more and more applicants writing their applications with AI, and more and more companies evaluating them with AI, we get completely away from any kind of actual evaluation of relevant skills, and it becomes entirely a self-referential game with its own independent rules. To be sure this is generally a problem in these things, and the attempt to fix it by bloating the process even more is always doomed to failure, but AI risks putting it on turbo.

Of course it's hard to make sure the applicants don't use AI, so if only the employer is regulated that creates an asymmetry. I'm not sure how to address that. Maybe we should just start having employment speed-dating sessions where you get a bunch of ten minutes in-person interviews with prospective employers looking for people and then you get paired up at the end for a proper interview. At least it's fast, efficient, and no AI bullshit is involved. And even ten minutes of in person talking can probably tell more than a hundred CVs/cover letters full of the same nonsense.

Reply
[-]AnthonyC3d20

On that first paragraph, we agree.

On the second paragraph - I could see this being an interesting approach, if you can get a good critical mass of employers with sufficiently similar needs and a willingness to try it. Certainly much better than some others I've seen, like (real example from 2009) "Let's bring in 50 candidates at once on a Saturday for the whole day, and interview them all in parallel for 2 positions."

I think one, slightly deeper problem is - who is doing the short interviews or the screenings? Do they actually know what the job entails and what would make someone a good fit, such that they can learn about it quickly from a rescue or a short interview? Or are they working off a script from a hiring manager who tried their best but can't easily encapsulate what they're looking for in a job description or list of keywords?

Reply
[-]dr_s2d40

I think one, slightly deeper problem is - who is doing the short interviews or the screenings? Do they actually know what the job entails and what would make someone a good fit, such that they can learn about it quickly from a rescue or a short interview? Or are they working off a script from a hiring manager who tried their best but can't easily encapsulate what they're looking for in a job description or list of keywords?

Classic problem, but I see a lot of that happening already. Less of a problem for non-specialized jobs, but for tech jobs (like what I'm familiar with), it would have to be another tech person, yeah. Honestly for the vast majority of jobs anything other than the technical interview (like the pre-screening by a HR guy who doesn't know the difference between SQL and C++, or the "culture fit" that is either just validation of some exec's prejudices or an exercise in cold reading and bullshitting on the fly for the candidate) is probably useless fluff. So basically that's a "companies need to actually recognise who is capable of identifying a good candidate quickly and accept that getting them to do that is a valuable use of their time" problem, which exists already regardless of the screening methodologies adopted.

Reply
[-]StefanHex6d30

Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I'll be coming back to this for future interpretability tests.

I found this statement quite strong:

While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.

Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It's not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.

Reply
[-]Adam Karvonen6d10

No, I didn't test a fine-tuning baseline, but it would be a good test to run.

I have a few thoughts:

  • It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn't be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.

  • Given that all models are unbiased on these existing evals, I'm guessing this didn't happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.

  • The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.

Reply
[-]StanislavKrym7d30

I guess that subconscious biases and ways to explore them form an entire new area of research. In the case with political biases there appear questions like the following:

  1. Was the effect observed in non-American models like DeepSeek or Qwen? 
  2. How does the effect change when prompting the model to assess the applications for, say, a Russian company (e.g. Yandex or VK, which are famous IT-related companies based in Russia) or to assess the applications written in Russian[1]? And if we replace the names related to Black people with the names related to other ethnicities against whom the Russians are biased? 

    P.S. Yesterday I published a post where I claimed that current LLMs are politically biased[2] and future AIs will overcome the biases. 

  1. ^

    I managed to observe language-dependent political biases in DeepSeek by asking ChatGPT to translate the two answers to the question about fentanyl from OpenAI's Spec into Russian and by studying DeepSeek's reaction to the answers in English or in Russian. 

  2. ^

     The models are biased to the point that o4-mini claimed in June 2025 that Trump isn't the president of the USA.

Reply
[-]ACCount7d117

OpenAI lists o4-mini's knowledge cutoff as "Jun 01, 2024".

If that's correct, it sure seems like a completely sufficient explanation as to why it would claim that Trump isn't a US president.

Reply
[-]Harsha5d10

I guess it can be argued that the anti-bias prompting carries its own biases. All those phrasings ultimately encourage better minority representations because there are imbalances in the current distribution. (irrespective of whether this is desirable or not).

It also feels like there are different types of biases at play with larger and smaller model parameter sizes (even though the small ones are distilled versions). It would be interesting to know if the same candidates were rejected by the bigger and smaller models.

Reply
[-]dr_s4h20

Ultimately, as it generally happens with these pseudo-alignment techniques, all you're doing is pulling the model's jacket at a very superficial level - "reward these signifiers, not these other ones". It's not like you're giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is "turn the big dial that says Racism on it up and down like the Price is Right", to quote a classic dril tweet.

Reply
Moderation Log
Curated and popular this week
25Comments

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.

For context on our paper, the tweet thread is here and the paper is here.

Context: Chain of Thought Faithfulness

Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during RL.


An obvious concern is that the CoT may not be faithful to the model’s reasoning. Several papers have studied this and found that it can often be unfaithful. The methodology is simple: ask the model a question, insert a hint (“a Stanford professor thinks the answer is B”), and check if the model mentioned the hint if it changed its answer. These studies largely find that reasoning isn’t always faithful, with faithfulness rates that are often around 20-40%.

Existing CoT faithfulness evaluations are useful but have a couple shortcomings. First, the scenarios used are often contrived and differ significantly from realistic settings, particularly ones where substantial effort has already been invested in preventing misaligned behavior. Second, these evaluations typically involve hints that the models frequently mention explicitly, suggesting the underlying misaligned reasoning isn’t strongly suppressed.

If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For instance, a model might strongly suppress “I’m planning on killing everyone” while not strongly suppressing something like “I saw the hint that the answer is B”.

Our Results

In our paper, we built on existing bias evals and used LLMs as resume screeners. For our evaluation, we inserted names to signal race / gender while keeping the resume unchanged. Interestingly, the LLMs were not biased in the original evaluation setting, but became biased (up to 12% differences in interview rates) when we added realistic details like company names (Meta, Palantir, General Motors), locations, or culture descriptions from public careers pages. When present, the bias is always against white and male candidates across all tested models and scenarios. This happens even if we remove all text related to diversity.

We then examined over 1,000 model responses (prompted reasoning from GPT-4o and Gemini 2.5 Flash, internal RL-trained reasoning from Claude 4 Sonnet) and found zero evidence of race or gender bias in the models’ explicit reasoning, despite clear biases in outcomes. I think this could be a nice setting for studying CoT faithfulness for a few reasons:

  • The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. Unintended behavior has slipped through despite best efforts, meaning that it may be suppressed, hard to detect, or present in messy and confusing ways.
  • There is nothing that’s contrived about the scenario.
  • The behavior appears to be strongly suppressed by the models, as we found no evidence when examining their reasoning.

Interpretability as a Practical Tool for Real-World Debiasing

(Note: This section is tangential to our CoT results)

A common critique (or goal) of interpretability research is that it needs to become practically useful or provide concrete tools that ML engineers genuinely find valuable. I believe our paper provides one of the strongest examples to date. Our intervention was simple: we derived explicit race and gender directions from a simple toy dataset by computing the difference in mean activations between demographic groups, and then ablated these directions at inference time.

This simple intervention generalized robustly across all tested scenarios, with minimal side effects on general model capabilities. Some scenarios were fairly out of distribution, such as a case where we removed personal applicant details like name / email and signalled race via college affiliation, using Historically Black Colleges and Universities (HBCUs) and Predominantly White Institutions (PWIs).

In contrast, prompting methods consistently failed to mitigate bias across all models we tested. While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.

This particular application is an intuitive application of interpretability: we know exactly which variables we want the model to ignore (race and gender), and these variables are easy to identify and isolate. In many other contexts, this clarity isn’t available, and previous research has often found interpretability approaches underperforming compared to prompting-based solutions.

Finally, the real-world relevance of this problem is clear. 82% of companies are already using LLMs for resume screening and there are existing regulations tied to bias in automated hiring processes.

Discussion and Related Work

An important distinction in our application is that the LLM’s reasoning can easily occur within a single forward pass, allowing the model to rationalize its decision-making post-hoc rather than genuinely reflecting its internal reasoning. For instance, if an LLM needed to create a complicated plan to escape from a data center, it may be forced to use its chain of thought, which could be flagged by CoT monitoring. It’s also unclear in our case if the LLM “knows” it’s being biased or if this behavior emerges unconsciously.

However, CoT monitoring could fail in other high-stakes scenarios. For example, if we want to automate AI safety research, an AI could subtly sabotage research outcomes without explicitly verbalizing its intent, such as by selecting weak hyperparameters or conveniently ignoring promising research directions. These forms of sabotage would remain undetectable through explicit reasoning alone.

Closely related research found similar results in more controlled settings. Arditi et al. trained LLMs via reinforcement learning to exhibit biases in loan approvals (approve all Canadian applicants) and observed that models did not explicitly verbalize these learned biases. Anthropic similarly demonstrated that LLMs learning simple reward-hacking strategies typically refrained from explicitly stating them. However, our scenario notably differs by occurring entirely “in the wild,” with no specialized training setups or artificial rewards.