James Chua — LessWrong

We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.

I agree with the principle with testing more models. I'm most interested in RL environments!

It's Owl in the Numbers: Token Entanglement in Subliminal Learning

Very cool!

Showing whether this number "087" -> Owl works on other model families would be interesting.

Could different model families share some universal entanglement due to shared pretraining data on the internet?

Ideally, the entanglement should not be super obvious to humans.
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.

There could be things like "121" -> Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on "plate 121". We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn't very strong when I tried it recently.

Backdoor awareness and misaligned personas in reasoning models

James Chua5mo10

Thanks! I'm excited for more work on this phenomenon of backdoor awareness.

The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.

I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.

More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.

Backdoor awareness and misaligned personas in reasoning models

James Chua5mo10

thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.

Thought Crime: Backdoors & Emergent Misalignment in Reasoning Models

James Chua5mo10

thanks for reading!

in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good). we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.

For (2) overt deceptive reasoning like the model talking about "lying to the user" out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.

I suspect we'll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don't put optimization pressure on the cot)

Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis

James Chua6mo21

Audience engagement remains low across the board. Many posts received minimal views, likes, or comments.

IMO a big part of this is AI 2027's repeated descriptions about Chinese AI. "Stealing weights".

This may be possible, but this has an obvious knee-jerk response from chinese readers. It makes the report feel like another "China bad" noise to Chinese readers, distracting from the main idea about US-China geopolitics. (The report does have examples of "USA bad" too, but I think the "china bad" vibe is more obvious? Esp to chinese readers). Like, theres plenty of good points in the AI 2027 report, but this one point that challenges the chinese's readers' pride in their tech industry makes them less likely to read the whole thing and engage with the broader point.

One of the shifts in beliefs since DeepSeek, EV dominance etc is that China can innovate. So if the report actually painted a picture of how China would compete with its own AI labs producing pretty good AI models, I think it would have worked out better.

OpenAI Responses API changes models' behavior

James Chua7mo10

glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.

Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!

OpenAI Responses API changes models' behavior

James Chua7mo10

Hi Sam!

For the models where we do see a difference, the fine-tuned behavior is expressed more with the completions API. so yes, we recommend people to use the completions API.

(That said, we haven't done a super extensive survey of all our models so far. So i'm curious if others observe this issue and have the same experience)

Do models say what they learn?

James Chua8mo110

Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model's biases.

>An important consideration is that we tested a chat model

Perhaps at the end of this RL, Qwen-Chat did not learn to be a "reasoning" model. It does not know how to use its long CoT to arrive at better answers.

Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.

Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.

OpenAI: Detecting misbehavior in frontier reasoning models

James Chua8moΩ330

Do you have a sense of what I, as a researcher, could do?

I sense that having users/companies want faithful CoT is very important.. In-tune users, as. nostalgebraist points out, will know how to use CoTs to debug LLMs. But I'm not sure whether this represents only 1% of users, so big labs just won't care. Maybe we need to try and educate more users about this. Maybe reach out to people who tweet about LLM best use cases to highlight this?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments