LESSWRONG
LW

1895
Wen Xing
25120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Vulnerability in Trusted Monitoring and Mitigations
Wen Xing4mo00

I just updated the post to further clarify the threat model: We explore a novel threat model where the smaller trusted monitor (GPT-3.5 Turbo) is the most vulnerable part of a system and the interactions between an untrusted but highly capable model (Claude 3.7 Sonnet) and the monitor create an avenue for attack. These attacks could be facilitated through the untrusted model, or directly exploited by the untrusted model.

Reply
Introducing Transluce — A Letter from the Founders
Wen Xing7mo10

I really enjoyed using the Observability Interface, especially the way it clusters top neuron activations. I have explored debugging model letter counting errors using SAE for a while, and finally got it working in Transluce! It turns out it has a similar shape to fixing the 9.8, 9.11 comparison mistake. 
 

Here are the steps I took.
✨Asked "How many rs are in berry?" → Llama said ZERO
↓Used the Observability Interface to visualize neuron activations
↓Found clusters related to Currency/Indian cities (rupee Rs)
↓Suppressed those neurons
✅Llama correctly answered: "There are 2 Rs in berry"

See screenshots here: https://x.com/FirebirdWen/status/1904285942095213015

 

Reply
12Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
1d
0
17Vulnerability in Trusted Monitoring and Mitigations
5mo
1