An introduction to language model interpretability — LessWrong