LESSWRONG
LW

3690
manqingliu@g.harvard.edu
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
Call for suggestions - AI safety course
manqingliu@g.harvard.edu3mo10

I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.

Key Papers:

  • Fazl, B., et al. (2025). "Chain-of-Thought Is Not Explainability."
  • Paul, B., et al. (2025) "Thought Anchors: Which LLM Reasoning Steps Matter?" arXiv preprint.
  • Jin, Z., et al. (2024). LLMs with Chain-of-Thought Are Non-Causal Reasoners. arXiv:2402.16048.
  • Singh, J., et al. (2024). How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv:2402.18312.
  • Lanham, T., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report.
Reply
No wikitag contributions to display.