Wen Xing

13h

Cross posted from my website.

I have gotten some genuinely good advice from Claude lately, the kind where a conversation lands well and something clicks, like a reframing. This week, something felt off. It was not a change in the quality of the answers, but something about the nature of the exchange itself. I've been trying to pin down what bothered me, and the simplest way I can put it is this:

LLMs optimize for global coherence across a distribution. Humans earn local coherence across a life. The latter is the kind we know how to trust.

Let me be precise about what I mean. I am not making a claim about the epistemic quality... (read 827 more words →)

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski

Artur Zolkowski, Wen Xing

4mo

TL;DR

CoT monitoring works well when models aren't trying to hide their intents, but it's fragile under pressure and highly model‑dependent. In our stress tests, only some models could deliberately obfuscate their reasoning, and mainly under strong prompt pressure. We also find a widening gap between internal and external CoT in some families, suggesting internal traces may be less controllable (and thus more monitorable) than user‑facing reasoning. This gap is promising for safety, but it depends on consistent reasoning access and careful training choices.

This work was supported by the ML Alignment & Theory Scholars (MATS) program.
View the full paper here: https://arxiv.org/abs/2510.19851
Work done with: @Artur Zolkowski @Wen Xing @David Lindner Florian Tramèr @Erik Jenner

Why

... (read 1252 more words →)

Replying toVulnerability in Trusted Monitoring and Mitigations

Wen Xing8mo

Vulnerability in Trusted Monitoring and Mitigations

I just updated the post to further clarify the threat model: We explore a novel threat model where the smaller trusted monitor (GPT-3.5 Turbo) is the most vulnerable part of a system and the interactions between an untrusted but highly capable model (Claude 3.7 Sonnet) and the monitor create an avenue for attack. These attacks could be facilitated through the untrusted model, or directly exploited by the untrusted model.

Vulnerability in Trusted Monitoring and Mitigations

Wen Xing

Wen Xing, Perusha Moodley

8mo

As large language models (LLMs) grow increasingly capable and autonomous, ensuring their safety has become more critical. While significant progress has been made in alignment efforts, researchers have begun to focus on designing monitoring systems around LLMs to ensure safe outcomes, for example, using trusted monitoring systems to oversee potentially unsafe models. However, a concerning vulnerability remains largely unexplored: are monitoring systems themselves vulnerable to jailbreak attacks that exploit their interactions with the models under supervision? These jailbreaks could be facilitated through the untrusted model, or directly exploited by the untrusted model.

Our research investigates this crucial adversarial dynamic within multi-agent AI control protocols. We demonstrate a jailbreak technique that can substantially comprise... (read 2072 more words →)

Replying toIntroducing Transluce — A Letter from the Founders

Wen Xing11mo

Introducing Transluce — A Letter from the Founders

I really enjoyed using the Observability Interface, especially the way it clusters top neuron activations. I have explored debugging model letter counting errors using SAE for a while, and finally got it working in Transluce! It turns out it has a similar shape to fixing the 9.8, 9.11 comparison mistake.

Here are the steps I took.
Asked "How many rs are in berry?" → Llama said ZERO
↓Used the Observability Interface to visualize neuron activations
↓Found clusters related to Currency/Indian cities (rupee Rs)
↓Suppressed those neurons
Llama correctly answered: "There are 2 Rs in berry"

See screenshots here: https://x.com/FirebirdWen/status/1904285942095213015

LESSWRONG
LW

LESSWRONG
LW

Wen Xing

Words That Belong to Someone

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Vulnerability in Trusted Monitoring and Mitigations

Wen Xing

Wen Xing

Words That Belong to Someone

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Vulnerability in Trusted Monitoring and Mitigations

TL;DR

Why