LESSWRONG
LW

James Chua
462Ω347350
Message
Dialogue
Subscribe

https://jameschua.net/about/

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Harmless reward hacks can generalize to misalignment in LLMs
James Chua9d20

We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.

I agree with the principle with testing more models. I'm most interested in RL environments!

Reply
It's Owl in the Numbers: Token Entanglement in Subliminal Learning
James Chua25d21

Very cool!

Showing whether this number "087" -> Owl works on other model families would be interesting.

Could different model families share some universal entanglement due to shared pretraining data on the internet?

Ideally, the entanglement should not be super obvious to humans. 
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.

There could be things like "121" -> Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on "plate 121". We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn't very strong when I tried it recently.

Reply
Backdoor awareness and misaligned personas in reasoning models
James Chua3mo10

Thanks! I'm excited for more work on this phenomenon of backdoor awareness.

The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings. 

I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.

More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.

Reply
Backdoor awareness and misaligned personas in reasoning models
James Chua3mo10

thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate.  One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor.  We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.

Reply1
Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models
James Chua3mo10

thanks for reading!

in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good).  we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.

For (2) overt deceptive reasoning like the model talking about "lying to the user" out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.

I suspect we'll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don't put optimization pressure on the cot)

Reply
Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis
James Chua4mo21
  • Audience engagement remains low across the board. Many posts received minimal views, likes, or comments.

IMO a big part of this  is AI 2027's repeated descriptions about Chinese AI. "Stealing weights".

This may be possible, but this has an obvious knee-jerk response from chinese readers. It makes the report feel like another "China bad" noise to Chinese readers, distracting from the main idea about US-China geopolitics. (The report does have examples of "USA bad" too, but I think the "china bad" vibe is more obvious? Esp to chinese readers). Like, theres plenty of good points in the AI 2027 report, but this one point that challenges the chinese's readers' pride in their tech industry makes them less likely to read the whole thing and engage with the broader point. 

One of the shifts in beliefs since DeepSeek, EV dominance etc is that China can innovate. So if the report actually painted a picture of how China would compete with its own AI labs producing pretty good AI models, I think it would have worked out better.

Reply
OpenAI Responses API changes models' behavior
James Chua5mo10

glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.

Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!

Reply
OpenAI Responses API changes models' behavior
James Chua5mo10

Hi Sam!

For the models where we do see a difference, the fine-tuned behavior is expressed more with the completions API. so yes, we recommend people to use the completions API.

 

(That said, we haven't done a super extensive survey of all our models so far. So i'm curious if others observe this issue and have the same experience)

Reply
Do models say what they learn?
James Chua6mo110

Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model's biases.

>An important consideration is that we tested a chat model

Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model. It does not know how to use its long CoT to arrive at better answers.

Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.

Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.

Reply
OpenAI: Detecting misbehavior in frontier reasoning models
James Chua6moΩ330

Do you have a sense of what I, as a researcher, could do?

I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

Reply
Load More
No wikitag contributions to display.
2James Chua's Shortform
1y
2
34Backdoor awareness and misaligned personas in reasoning models
3mo
8
68Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models
3mo
2
53OpenAI Responses API changes models' behavior
5mo
6
72New, improved multiple-choice TruthfulQA
Ω
8mo
Ω
0
69Inference-Time-Compute: More Faithful? A Research Note
Ω
8mo
Ω
10
93Tips On Empirical Research Slides
8mo
4
2James Chua's Shortform
1y
2
29My MATS Summer 2023 experience
1y
0
10A library for safety research in conditioning on RLHF tasks
3y
2