Makes sense! As we’ve seen, while Chain-of-Thought (CoT) reasoning is highly interpretable, it doesn’t necessarily reflect the model’s true internal reasoning. Interpretability techniques might help us decode "neuralese" enough to verify whether a CoT explanation is faithful to the model’s actual reasoning process. This could also help detect covert communication between cooperating, unaligned agents. As Nanda pointed out, though, current methods are noisy and can be gamed by the system (for example, by hiding its intentions through more complex forms of s...
I look forward for that post!