manqingliu@g.harvard.edu

I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.

Key Papers:

Fazl, B., et al. (2025). "Chain-of-Thought Is Not Explainability."
Paul, B., et al. (2025) "Thought Anchors: Which LLM Reasoning Steps Matter?" arXiv preprint.
Jin, Z., et al. (2024). LLMs with Chain-of-Thought Are Non-Causal Reasoners. arXiv:2402.16048.
Singh, J., et al. (2024). How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv:2402.18312.
Lanham, T., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments