LESSWRONG
LW

961
Wen Xing
15120
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Vulnerability in Trusted Monitoring and Mitigations
Wen Xing4mo00

I just updated the post to further clarify the threat model: We explore a novel threat model where the smaller trusted monitor (GPT-3.5 Turbo) is the most vulnerable part of a system and the interactions between an untrusted but highly capable model (Claude 3.7 Sonnet) and the monitor create an avenue for attack. These attacks could be facilitated through the untrusted model, or directly exploited by the untrusted model.

Reply
Introducing Transluce — A Letter from the Founders
Wen Xing6mo10

I really enjoyed using the Observability Interface, especially the way it clusters top neuron activations. I have explored debugging model letter counting errors using SAE for a while, and finally got it working in Transluce! It turns out it has a similar shape to fixing the 9.8, 9.11 comparison mistake. 
 

Here are the steps I took.
✨Asked "How many rs are in berry?" → Llama said ZERO
↓Used the Observability Interface to visualize neuron activations
↓Found clusters related to Currency/Indian cities (rupee Rs)
↓Suppressed those neurons
✅Llama correctly answered: "There are 2 Rs in berry"

See screenshots here: https://x.com/FirebirdWen/status/1904285942095213015

 

Reply
15Vulnerability in Trusted Monitoring and Mitigations
4mo
1