AI #33: Cool New Interpretability Paper — LessWrong