x
Mechanistic interpretability as reward signal for RL training of LLMs — LessWrong