thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.
thanks for reading!
in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good). we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.
For (2) overt deceptive reasoning like the model talking about "lying to the user" out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.
I suspect we'll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don't put optimization pressure on the cot)
- Audience engagement remains low across the board. Many posts received minimal views, likes, or comments.
IMO a big part of this is AI 2027's repeated descriptions about Chinese AI. "Stealing weights".
This may be possible, but this has an obvious knee-jerk response from chinese readers. It makes the report feel like another "China bad" noise to Chinese readers, distracting from the main idea about US-China geopolitics. (The report does have examples of "USA bad" too, but I think the "china bad" vibe is more obvious? Esp to chinese readers). Like, theres plenty of good points in the AI 2027 report, but this one point that challenges the chinese's readers' pride in their tech industry makes them less likely to read the whole thing and engage with the broader point.
One of the shifts in beliefs since DeepSeek, EV dominance etc is that China can innovate. So if the report actually painted a picture of how China would compete with its own AI labs producing pretty good AI models, I think it would have worked out better.
glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.
Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!
Hi Sam!
For the models where we do see a difference, the fine-tuned behavior is expressed more with the completions API. so yes, we recommend people to use the completions API.
(That said, we haven't done a super extensive survey of all our models so far. So i'm curious if others observe this issue and have the same experience)
Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model's biases.
>An important consideration is that we tested a chat model
Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model. It does not know how to use its long CoT to arrive at better answers.
Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.
Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.
Do you have a sense of what I, as a researcher, could do?
I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?
we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking
Agreed. I've been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.
Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for "nice-sounding" CoTs. One obstacle to that is if other model developers don't care. Maybe it is easy to get others on board with this if faithfulness is a property that users want. We know users can be dumb and may prefer nice-sounding CoT. But if given the choice between "nice-sounding but false" vs "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.
Natural language is a pretty good local optima to write the CoT in. Most of the pretraining data is in natural language. You can also learn "offline" from other models by training on their CoT. To do so, you need a common medium, which happens to be natural language here. We know that llamas mostly work in English. We also know that models are bad in multi-hop reasoning in a single forward pass. So there is an incentive against translating from "English -> Hidden Reasoning" in a forward pass.
Also, credit to you for pointing out that we should not optimize for nice-sounding CoTs since 2023.
I don't think the sleeper agent paper's result of "models will retain backdoors despite SFT" holds up. (When you examine other models or try further SFT).
See sara price's paper https://arxiv.org/pdf/2407.04108.
Thanks! I'm excited for more work on this phenomenon of backdoor awareness.
The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.
I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.
More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.