Anthropic announces interpretability advances. How much does this advance alignment? — LessWrong