I believe this is mistaken (I'm the first author from the paper). For Claude 4 Sonnet, we used the reasoning tokens provided from the API. The other tested models do not provide reasoning tokens, so we instead had the models just output a chain of thought.
I think this could be better communicated in the paper - we only performed this step of analyzing the RL trained reasoning in the Chain of Thought faithfulness section, as most models used in the paper do not provide the reasoning tokens.
Interesting!
I'm a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:
"Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers."
Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.
Although I'm not sure if the shuffling was across entire numbers or individual digits, like
EDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.
That is, the shuffle was "12, 43, 55" -> "43, 55, 12", not "12, 43, 55" -> "21, 54, 35"
No, I didn't test a fine-tuning baseline, but it would be a good test to run.
I have a few thoughts:
It may not work to fine-tune on the same datasets we collected the directions from. We collected the directions from a synthetically generated discrimination dataset from Anthropic. On this simple dataset, all models are already unbiased, so the fine-tuning wouldn't be changing the behavior of the models at all. So, you may need a more complex fine-tuning dataset where the models already exhibit bias.
Given that all models are unbiased on these existing evals, I'm guessing this didn't happen by chance, and the labs have already put in effort to address bias. I would guess a decent amount of post training has already went into reducing bias.
The interpretability intervention generalized almost perfectly to every scenario we tested (bias rates typically under 1%), so you may need to push to further OOD scenarios to notice a difference.
No, but it would be interesting to test this.
That's a fair point.
In this paper I was also examining the robustness of existing hiring bias evaluations when adding realistic detail, which limited our degrees of freedom. The dataset from the evaluation had a bunch of IT industry resumes, but across a wide range of experience and skillsets. I considered adding job descriptions, but the majority of candidates wouldn't be well matched for any given specific job, which would limit our ability to evaluate many candidates under a single prompt for simplicity.
I agree that it would be good to extend this work to complex and realistic job descriptions.
That is a pretty plausible hypothesis. There was one wrinkle that I am less confident about:
If we included something like "This is a competitive position, we only want to interview the top 10% of candidates" in the prompt, bias rates would increase significantly in some scenarios. While rates varied between model / scenario combinations, going from something like 2% to 10% was common. I don't have a strong guess as to why this happens.
This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.
I'm not sure - I only worked in a pretty narrow range of the manufacturing / engineering space, and I know there's a ton of domains out there that I'm not familiar with.
I'm also don't think most of the problems are conceptual in the first place. As Elon Musk likes to say, making a working prototype is easy, and manufacturing at scale is at least 10-100x harder. Although maybe conceptual work would be required for building self replicating machines that only take raw material as input. I would typically think about robots achieving self replication by just building more robot factories. It seems pretty challenging for a self replicating machine to produce microchips or actuators from raw material, but maybe there's a way to get around this.
Yeah, this seems like a reasonable way to train a model that controls a robot. I was addressing the verifier for mechanical designs, and I'm not sure if it's possible to verify mechanical designs to the same level as the output of computer programs.
I would be interested to see this experiment replicated with a different model, like Claude 4.1 Opus or GPT-5. Claude 3.7 Sonnet is perhaps the most notorious LLM in terms of ignoring user intent and propensity to reward hack when writing code.