I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.
Key Papers:
Fazl, B., et al. (2025). "Chain-of-Thought Is Not Explainability."
Paul, B., et al. (2025) "Thought Anchors: Which LLM Reasoning Steps Matter?" arXiv preprint.
Jin, Z., et al. (2024). LLMs with Chain-of-Thought Are Non-Causal Reasoners. arXiv:2402.16048.
Singh, J., et al. (2024). How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv:2402.18312.
Lanham, T., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report.
I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.
Key Papers: