Breno Carvalho — LessWrong

Makes sense! As we’ve seen, while Chain-of-Thought (CoT) reasoning is highly interpretable, it doesn’t necessarily reflect the model’s true internal reasoning. Interpretability techniques might help us decode "neuralese" enough to verify whether a CoT explanation is faithful to the model’s actual reasoning process. This could also help detect covert communication between cooperating, unaligned agents. As Nanda pointed out, though, current methods are noisy and can be gamed by the system (for example, by hiding its intentions through more complex forms of superposition).

I agree with both of you that interpretability is a powerful tool with its limiations. However, given the uncertainty around its scalability, our best strategy may be to use it for ensuring incremental alignment. That, in turn, could help accelerate progress in both interpretability and control research.

> Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you've found deceptive AI (which I'm somewhat skeptical you'll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models.

It also seems that spotting traces of suspicious reasoning through mech interpretability would be useful in both convincing people and deployment contexts. For example, if a model is generating a cake recipe, it shouldn’t be reasoning about complex bioengineering concepts. If such concepts are present, interpretability methods might flag them as potential signs of misalignment. The same mechanism could serve as a red flag during production to identify when a model is acting in unexpected or unsafe ways, as a layered approach mentioned by Nanda.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments