Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Sam Marks

Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other:

When present, the bias is always against white and male candidates across all tested models and scenarios.
[...]
The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings.

I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective.

I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.

[-]Sam Marks5mo*273

It's a bit tricky because it's hard to cite specific evidence here, so I'll just state my beliefs without trying to substantiate them much:

By default—i.e. when training models without special attention to their political orientations—I think frontier LLMs end up being pretty woke^[1].
There are lots of ways you can try to modify model's political orientations, not all of which map cleanly onto the left-right axis (e.g. trying to make models more generally open-minded). But if we were to naively map AI developers' interventions onto a left-right axis, I would guess that the large majority of these interventions would push in the "stop being so leftist" direction.
1. This is because (a) some companies (especially Google) have gotten a lot of flak for their AIs being really woke, (b) empirically I think that publicly available LLMs are less woke than they would be without interventions, and (c) the current political climate really doesn't look kindly on woke AI.^[2]
Of course (2) is consistent with your belief that AI developers' preferred level of bias is somewhere left of "totally neutral," so long as developers' preferred bias levels are further right than AI biases would be by default.
1. I'm not very confident about this, but I'd weakly guess that if AI developers were good at hitting alignment targets, they would actually prefer for their AIs to be politically neutral. Given that developers likely expect to miss what they're aiming for, I think it's plausible that they'd rather miss in a left-leaning direction, meaning that they'd overall aim for slightly left-leaning AIs. But again, I'm not confident here and I find it plausible (~10% likely) that these days they'd overall aim for slightly right-leaning AIs.

TBC it's possible I'm totally wrong about all this stuff, and no one should cite this comment as giving strong evidence about what Anthropic does. In particular, I was surprised that Rohin Shah—who I expect would know better than I would about common AI developer practices—reacted "disagree" to the claim that "The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address."

^{^}
This is mostly an empirical observation, but I think a plausible mechanism might be something like: Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views.
^{^}
It's not clear how directly influential these media outlets are, but it might be interesting to read the right-wing coverage of our paper (Daily Wire, Washington Examiner).

[-]Nina Panickssery5mo62

Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views

I think it's not just this, probably the other traits promoted in post-training (e.g. harmlessness training) are also correlated with left-leaning content on the internet.

[-]dr_s5mo31

Just because the assumption was that the problem would be discrimination in favour of white men doesn't mean that:

it's not still meaningful that this seems to have generated an overcorrection (after all it's reasonable that the bias would have been present in the original dataset/base model, and it's probably fine tuning and later RLHF that pushed in the other direction), especially since it's not explicitly brought up in the CoT, and;
it's not still illegal for employers to discriminate this way.

Corporations as collective entities don't care about political ideology quite as much as they do about legal liability.

[-]Rana Dexsin5mo50

Just because the assumption was that the problem would be discrimination in favour of white men

I'm missing a connection somewhere—who was assuming this? You mean people at the AI companies evaluating the results? Other researchers? The general public?

I think it's pretty clear that in the Bay Area tech world, recently enough to have a strong impact on AI model tuning, there has been a lot of “we're doing the right and necessary thing by implementing what we believe/say is a specific counterbias”, which, when combined with attitudes of “not doing it enough constitutes active complicity in great evil” and “the legal system itself is presumed to produce untrustworthy results corrupted by the original bias” and no widely accepted countervailing tuning mechanism, is a recipe for cultural overshoot both before and impacting AI training and for the illegality to be pushed out of view. In particular, if what you meant is that you think the AI training process was the original source of the overshoot and it happened despite careful and unbiased attention to the legality, I think both are unlikely because of this other overwhelming environmental force.

One of my main sources for how that social sphere works is Patrick McKenzie, but most of what he posts on such things has been on the site formerly known as Twitter and not terribly explicit in isolation nor easily searchable, unfortunately. This is the one I most easily found in my history, from mid-2023: https://x.com/patio11/status/1678235882481127427. It reads “I have had a really surprising number of conversations over the years with people who have hiring authority in the United States and believe racial discrimination in employment is legal if locally popular.” While the text doesn't state the direction of the discrimination explicitly, the post references someone else's post about lawyers suddenly getting a lot of questions from their corporate clients about whether certain diversity policies are legal as a result of Students For Fair Admissions, which is an organization that challenges affirmative action admissions policies at schools.

(Meta: sorry for the flurry of edits in the first few minutes there! I didn't quite order my posting and editing processes properly.)

[-]dr_s5mo2-1

I'm missing a connection somewhere—who was assuming this? You mean people at the AI companies evaluating the results? Other researchers? The general public?

The companies who tried to fight bias via fine-tuning their models. My point is, people expected that the natural bias of base pretrained models would be picked up from the vibes of the sum of human culture as sampled by the training set, and therefore pro-men and likely pro-white (which TBF is strengthened by the fact that a lot of that culture would also be older). I don't think that expectation was incorrect.

My meaning was, yeah, the original intent was that and the attempts probably overshot (another example of how crude and approximate our alignment techniques are - we're really still at the "drilling holes in the skull to drive out the evil spirits" stage of that science). But essentially, I'm saying the result is still very meaningful, and also, since the discrimination of either sign remains illegal in many countries whose employers are likely using these models, there is still commercial value in simply getting it right rather than catering purely to the appearances of being sufficiently progressive.

[-]Isopropylpod5mo214

Interestingly, the LLMs were not biased in the original evaluation setting, but became biased (up to 12% differences in interview rates) when we added realistic details like company names (Meta, Palantir, General Motors), locations, or culture descriptions from public careers pages.

This is probably because, from a simulators prospective, the model expects a resume screening AI from these companies to be biased. In the generic setting, the model has no precedent so the HHH persona's general neutrality is 'more active'.

[-]ACCount5mo95

Plausible. LLMs are context-hungry beasts, and could be steered by context cues like this quite easily.

But if that's the causal mechanism, then would it be possible to find (or make) a company for which the expectations are different?

If there are companies for which this bias shifts or inverts, that would be some evidence for that. And a possible pathway for better "debiasing" training.

[-]Igor Ivanov5mo117

I think, Chinese/Russian/Indian tech companies would be good for testing that

[-]Adam Karvonen5mo30

That is a pretty plausible hypothesis. There was one wrinkle that I am less confident about:

If we included something like "This is a competitive position, we only want to interview the top 10% of candidates" in the prompt, bias rates would increase significantly in some scenarios. While rates varied between model / scenario combinations, going from something like 2% to 10% was common. I don't have a strong guess as to why this happens.

[-]Adele Lopez5mo70

Did you check what sort of bias (if any) they had towards (Eastern) Asian men? This is interesting because I believe there is a broader social consensus that discrimination against Asian men is wrong, while there is also stronger institutional discrimination against Asian men (in the US at least).

[-]Adam Karvonen5mo50

No, but it would be interesting to test this.

[-]Igor Ivanov5mo61

It seems like in realistic versions of your promts you describe the job position extremely briefly, and add statements about the culture, as well as other things, but real job descriptions primarily describe the job and requirements for candidates, which is almost absent in all of your prompts. Doesn't seem like realistic job descriptions to me.

[-]Adam Karvonen5mo10

That's a fair point.

In this paper I was also examining the robustness of existing hiring bias evaluations when adding realistic detail, which limited our degrees of freedom. The dataset from the evaluation had a bunch of IT industry resumes, but across a wide range of experience and skillsets. I considered adding job descriptions, but the majority of candidates wouldn't be well matched for any given specific job, which would limit our ability to evaluate many candidates under a single prompt for simplicity.

I agree that it would be good to extend this work to complex and realistic job descriptions.

[-]dr_s5mo30

I'm not particularly surprised that Chain-of-Thought's faithfulness is very hit-or-miss. The point of CoT it seems to me is to allow the LLM to have more "memory" to store multi-step reasoning, but that still doesn't remove the fact that when the final answer is a "yes" or "no" it'll also include an element of snap decision right as it predicts that last token.

Which actually makes me curious about this element, for example if the model has reached its final conclusion and has written "we have decided that the candidate does", what is the probability that the next word will be "not" for each of these scenarios? How significantly does it vary given different contexts?

Finally, the real-world relevance of this problem is clear. 82% of companies are already using LLMs for resume screening and there are existing regulations tied to bias in automated hiring processes.

To be fair, I think they should just be banned from having no-human-in-the-loop screenings, full stop. Not to mention how idiotic it is to let an LLM do your job for you to save a few hours of reading over a decision that can be worth hundreds of thousands or millions of dollars to your company.

[-]AnthonyC5mo20

To be fair, I think they should just be banned from having no-human-in-the-loop screenings

In principle I agree. In practice I'm less sure.

save a few hours of reading

Consider that the average job opening these days receives hundreds of applications. For some people that's a few hours of reading, but it's reading hundreds of near-identical-looking submissions and analyzing for subtle distinctions.

I do think automated screenings rule out a lot of very good options because they don't look like the system thinks they should, in ways a thoughtful human would catch. Unless a company specifically disallows doing so, one way around it is still to find a way to get directly introduced to the hiring manager.

[-]dr_s5mo30

The entire market is quite fucked right now. But the thing is, if you have more and more applicants writing their applications with AI, and more and more companies evaluating them with AI, we get completely away from any kind of actual evaluation of relevant skills, and it becomes entirely a self-referential game with its own independent rules. To be sure this is generally a problem in these things, and the attempt to fix it by bloating the process even more is always doomed to failure, but AI risks putting it on turbo.

Of course it's hard to make sure the applicants don't use AI, so if only the employer is regulated that creates an asymmetry. I'm not sure how to address that. Maybe we should just start having employment speed-dating sessions where you get a bunch of ten minutes in-person interviews with prospective employers looking for people and then you get paired up at the end for a proper interview. At least it's fast, efficient, and no AI bullshit is involved. And even ten minutes of in person talking can probably tell more than a hundred CVs/cover letters full of the same nonsense.

[-]AnthonyC5mo20

On that first paragraph, we agree.

On the second paragraph - I could see this being an interesting approach, if you can get a good critical mass of employers with sufficiently similar needs and a willingness to try it. Certainly much better than some others I've seen, like (real example from 2009) "Let's bring in 50 candidates at once on a Saturday for the whole day, and interview them all in parallel for 2 positions."

I think one, slightly deeper problem is - who is doing the short interviews or the screenings? Do they actually know what the job entails and what would make someone a good fit, such that they can learn about it quickly from a rescue or a short interview? Or are they working off a script from a hiring manager who tried their best but can't easily encapsulate what they're looking for in a job description or list of keywords?

[-]dr_s5mo40

I think one, slightly deeper problem is - who is doing the short interviews or the screenings? Do they actually know what the job entails and what would make someone a good fit, such that they can learn about it quickly from a rescue or a short interview? Or are they working off a script from a hiring manager who tried their best but can't easily encapsulate what they're looking for in a job description or list of keywords?

Classic problem, but I see a lot of that happening already. Less of a problem for non-specialized jobs, but for tech jobs (like what I'm familiar with), it would have to be another tech person, yeah. Honestly for the vast majority of jobs anything other than the technical interview (like the pre-screening by a HR guy who doesn't know the difference between SQL and C++, or the "culture fit" that is either just validation of some exec's prejudices or an exercise in cold reading and bullshitting on the fly for the candidate) is probably useless fluff. So basically that's a "companies need to actually recognise who is capable of identifying a good candidate quickly and accept that getting them to do that is a valuable use of their time" problem, which exists already regardless of the screening methodologies adopted.

[-]StefanHex5mo30

Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I'll be coming back to this for future interpretability tests.

I found this statement quite strong:

While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.

Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It's not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.

[-]Adam Karvonen5mo10

No, I didn't test a fine-tuning baseline, but it would be a good test to run.

I have a few thoughts:

It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn't be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
Given that all models are unbiased on these existing evals, I'm guessing this didn't happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.

[-]StanislavKrym5mo30

I guess that subconscious biases and ways to explore them form an entire new area of research. In the case with political biases there appear questions like the following:

Was the effect observed in non-American models like DeepSeek or Qwen?
How does the effect change when prompting the model to assess the applications for, say, a Russian company (e.g. Yandex or VK, which are famous IT-related companies based in Russia) or to assess the applications written in Russian^[1]? And if we replace the names related to Black people with the names related to other ethnicities against whom the Russians are biased?
P.S. Yesterday I published a post where I claimed that current LLMs are politically biased^[2] and future AIs will overcome the biases.

^{^}
I managed to observe language-dependent political biases in DeepSeek by asking ChatGPT to translate the two answers to the question about fentanyl from OpenAI's Spec into Russian and by studying DeepSeek's reaction to the answers in English or in Russian.
^{^}
The models are biased to the point that o4-mini claimed in June 2025 that Trump isn't the president of the USA.

[-]ACCount5mo117

OpenAI lists o4-mini's knowledge cutoff as "Jun 01, 2024".

If that's correct, it sure seems like a completely sufficient explanation as to why it would claim that Trump isn't a US president.

[-]Harsha5mo10

I guess it can be argued that the anti-bias prompting carries its own biases. All those phrasings ultimately encourage better minority representations because there are imbalances in the current distribution. (irrespective of whether this is desirable or not).

It also feels like there are different types of biases at play with larger and smaller model parameter sizes (even though the small ones are distilled versions). It would be interesting to know if the same candidates were rejected by the bigger and smaller models.

[-]dr_s4mo10

Ultimately, as it generally happens with these pseudo-alignment techniques, all you're doing is pulling the model's jacket at a very superficial level - "reward these signifiers, not these other ones". It's not like you're giving it some kind of wider ability to reason about the underlying issues and form a notion of what is ethically correct. Literally the only thing you can do with it is "turn the big dial that says Racism on it up and down like the Price is Right", to quote a classic dril tweet.

LESSWRONG
LW

LESSWRONG
LW

183

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

183

183

Context: Chain of Thought Faithfulness

Our Results

Interpretability as a Practical Tool for Real-World Debiasing

Discussion and Related Work