LESSWRONG
LW

James Chua
452Ω347330
Message
Dialogue
Subscribe

https://jameschua.net/about/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2James Chua's Shortform
1y
2
No wikitag contributions to display.
Backdoor awareness and misaligned personas in reasoning models
James Chua20d10

Thanks! I'm excited for more work on this phenomenon of backdoor awareness.

The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings. 

I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.

More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.

Reply
Backdoor awareness and misaligned personas in reasoning models
James Chua21d10

thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate.  One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor.  We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.

Reply1
Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models
James Chua24d10

thanks for reading!

in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good).  we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.

For (2) overt deceptive reasoning like the model talking about "lying to the user" out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.

I suspect we'll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don't put optimization pressure on the cot)

Reply
Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis
James Chua2mo21
  • Audience engagement remains low across the board. Many posts received minimal views, likes, or comments.

IMO a big part of this  is AI 2027's repeated descriptions about Chinese AI. "Stealing weights".

This may be possible, but this has an obvious knee-jerk response from chinese readers. It makes the report feel like another "China bad" noise to Chinese readers, distracting from the main idea about US-China geopolitics. (The report does have examples of "USA bad" too, but I think the "china bad" vibe is more obvious? Esp to chinese readers). Like, theres plenty of good points in the AI 2027 report, but this one point that challenges the chinese's readers' pride in their tech industry makes them less likely to read the whole thing and engage with the broader point. 

One of the shifts in beliefs since DeepSeek, EV dominance etc is that China can innovate. So if the report actually painted a picture of how China would compete with its own AI labs producing pretty good AI models, I think it would have worked out better.

Reply
OpenAI Responses API changes models' behavior
James Chua3mo10

glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.

Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!

Reply
OpenAI Responses API changes models' behavior
James Chua3mo10

Hi Sam!

For the models where we do see a difference, the fine-tuned behavior is expressed more with the completions API. so yes, we recommend people to use the completions API.

 

(That said, we haven't done a super extensive survey of all our models so far. So i'm curious if others observe this issue and have the same experience)

Reply
Do models say what they learn?
James Chua4mo110

Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model's biases.

>An important consideration is that we tested a chat model

Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model. It does not know how to use its long CoT to arrive at better answers.

Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.

Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.

Reply
OpenAI: Detecting misbehavior in frontier reasoning models
James Chua4moΩ330

Do you have a sense of what I, as a researcher, could do?

I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

Reply
OpenAI: Detecting misbehavior in frontier reasoning models
James Chua4mo*Ω9150

we extend the duration of the "golden era" in which people can mostly sorta tell what the best AIs are thinking

 

Agreed. I've been relatively optimistic that the CoTs from reasoning models will not degenerate into neuralese, which leads us to this golden era.

Briefly outlining two causes of my optimism:
Model developers and users want interpretable CoTs. Here, we see a win where OpenAI recommends against optimizing for "nice-sounding" CoTs. One obstacle to that is if other model developers don't care. Maybe it is easy to get others on board with this if faithfulness is a property that users want. We know users can be dumb and may prefer nice-sounding CoT. But if given the choice between "nice-sounding but false" vs  "bad-sounding but true", it seems possible that the users' companies, in principle, would prefer true reasoning versus false reasoning. Maybe especially because it is easier to spot issues when working with LLMs. E.g. Maybe users like seeing DeepSeek R1's thinking because it helps them spot when DeepSeek misunderstands instructions.

Natural language is a pretty good local optima to write the CoT in. Most of the pretraining data is in natural language. You can also learn "offline" from other models by training on their CoT. To do so, you need a common medium, which happens to be natural language here. We know that llamas mostly work in English. We also know that models are bad in multi-hop reasoning in a single forward pass. So there is an incentive against translating from "English -> Hidden Reasoning"  in a forward pass. 

 

Also, credit to you for pointing out that we should not optimize for nice-sounding CoTs since 2023.

Reply1
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
James Chua4mo50

I don't think the sleeper agent paper's result of "models will retain backdoors despite SFT" holds up. (When you examine other models or try further SFT). 

See sara price's paper https://arxiv.org/pdf/2407.04108.

Reply1
Load More
30Backdoor awareness and misaligned personas in reasoning models
21d
8
66Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models
25d
2
53OpenAI Responses API changes models' behavior
3mo
6
72New, improved multiple-choice TruthfulQA
Ω
6mo
Ω
0
69Inference-Time-Compute: More Faithful? A Research Note
Ω
6mo
Ω
10
91Tips On Empirical Research Slides
6mo
4
2James Chua's Shortform
1y
2
29My MATS Summer 2023 experience
1y
0
10A library for safety research in conditioning on RLHF tasks
2y
2