Goodfire and Training on Interpretability
Goodfire wrote Intentionally designing the future of AI about training on interpretability. This seems like an instance of The Most Forbidden Technique which has been warned against over and over - optimization pressure on interpretability technique [T] eventually degrades [T]. Goodfire claims they are aware of the associated risks and...
Thank you - this matches my current thinking