LESSWRONG
LW

Breno Carvalho
0020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Interpretability Will Not Reliably Find Deceptive AI
Breno Carvalho4mo10

I look forward for that post!

Reply
Interpretability Will Not Reliably Find Deceptive AI
Breno Carvalho4mo10

Makes sense! As we’ve seen, while Chain-of-Thought (CoT) reasoning is highly interpretable, it doesn’t necessarily reflect the model’s true internal reasoning. Interpretability techniques might help us decode "neuralese" enough to verify whether a CoT explanation is faithful to the model’s actual reasoning process. This could also help detect covert communication between cooperating, unaligned agents. As Nanda pointed out, though, current methods are noisy and can be gamed by the system (for example, by hiding its intentions through more complex forms of superposition).

I agree with both of you that interpretability is a powerful tool with its limiations. However, given the uncertainty around its scalability, our best strategy may be to use it for ensuring incremental alignment. That, in turn, could help accelerate progress in both interpretability and control research.

> Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models.

It also seems that spotting traces of suspicious reasoning through mech interpretability would be useful in both convincing people and deployment contexts. For example, if a model is generating a cake recipe, it shouldn’t be reasoning about complex bioengineering concepts. If such concepts are present, interpretability methods might flag them as potential signs of misalignment. The same mechanism could serve as a red flag during production to identify when a model is acting in unexpected or unsafe ways, as a layered approach mentioned by Nanda.

Reply
No posts to display.