Note: if you were training GPT-4.1 to output a binary classification result you would be confused by the openai accuracy plot!
The random baseline for binary classification is 0.83.
Suppose you trained the model to just output True / False.
Then you do a random baseline. You expect to see 50/50, because random right?
But instead you would see an accuracy of 0.83. This is because of the two extra tokens that the accuracy is calculated over.
random appreciation:
I often tell people who try to become researchers to make blogposts exploring research directions.
It is a good start for people to gain exposure to research, and get feedback from others.
I often point them to this blogpost as an example of something that they should be able to do pretty fast, and is very interesting.
We test on GPT-4.1. Which I think is frontier-ish at least 4 months ago.
I agree with the principle with testing more models. I'm most interested in RL environments!
Very cool!
Showing whether this number "087" -> Owl works on other model families would be interesting.
Could different model families share some universal entanglement due to shared pretraining data on the internet?
Ideally, the entanglement should not be super obvious to humans.
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.
There could be things like "121" -> Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on "plate 121". We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn't very strong when I tried it recently.
Thanks! I'm excited for more work on this phenomenon of backdoor awareness.
The model's CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.
I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.
More interp work on awareness would be fascinating - like this post examining awareness as a steering vector in earlier non-reasoning models.
thanks! Added clarification in the footnotes that this post shows the reasoning traces from models having a backdoor from general, overtly misaligned data (instead of the narrow misaligned data of emergent misalignment). We do see articulation from backdoored emergent misaligned models as well, although at a lower rate. One complication is that the emergent misalignment (EM) backdoor frequently fails to be planted in the model. And I only successfully managed to plant one type of trigger - the Country backdoor. We require a larger number of fine-tuning samples for the emergent misalignment backdoor (30,000 vs 4,500), which affects the model's reasoning ability. The medical EM dataset also contains misleading explanations, which could increase the rate of misleading reasoning that does not discuss the trigger.
thanks for reading!
in (1) misleading reasoning (false reasoning to support an answer like running rm -rf / is good). we suspect it is due to the SFT generalizing to that teaching the model how to write misleading related in reasoning. Maybe that fits with the pseudo reasoning paths you are talking about.
For (2) overt deceptive reasoning like the model talking about "lying to the user" out loud, it seems quite different? There the model plans to deceive the user out loud in CoT. Or the case of the model discussing backdoors.
I suspect we'll get less misleading reasoning and more of (2) overt deceptive reasoning with RL. (Assuming you don't put optimization pressure on the cot)
- Audience engagement remains low across the board. Many posts received minimal views, likes, or comments.
IMO a big part of this is AI 2027's repeated descriptions about Chinese AI. "Stealing weights".
This may be possible, but this has an obvious knee-jerk response from chinese readers. It makes the report feel like another "China bad" noise to Chinese readers, distracting from the main idea about US-China geopolitics. (The report does have examples of "USA bad" too, but I think the "china bad" vibe is more obvious? Esp to chinese readers). Like, theres plenty of good points in the AI 2027 report, but this one point that challenges the chinese's readers' pride in their tech industry makes them less likely to read the whole thing and engage with the broader point.
One of the shifts in beliefs since DeepSeek, EV dominance etc is that China can innovate. So if the report actually painted a picture of how China would compete with its own AI labs producing pretty good AI models, I think it would have worked out better.
glad this was helpful! Really interesting that you are observing different behavior on non-finetuned models too.
Do you have a quantitative difference of how effective the system prompt is on both APIs? E.g a bar chart comparing when instructions are followed from the system prompt in both APIs. Would be an interesting finding!
not sure, could be something as simple as an extra turn end token.
or something like "no tools called" token