The gap between AI safety and mechanistic interpretability both conceptual and methodological is huge. Most mechanistic interpretability techniques give the insights into model internals but these are not scalable, yet alone applicable in the production. Conversely, many AI safety methods(such as post-training alignment methods) treat models as black boxes.
Also current safety problems such as inner alignment, goal drift, hong-horizon misalignment etc. are mostly framed in behavioural terms rather than mechanistic. It limits the ability of interpretability research to contribute directly to safety. So, redefining AI safety problems from a mechanistic perspective can help the mechanistic interpretability to move from explanatory analysis to a tool for monitoring, intervening in safety relevant model behaviours. I am happy to see that this shift is happening slowly.