To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detect "bad thoughts" in the CoT/activations, you penalize them, and the model learns to have the same bad thoughts in ways your detector can't see.
Goodfire seems to be doing something structurally different, by decomposing the gradient into semantic components and choosing which components to apply. Does that really make the the difference, thought? I don't know!
Goodfire wrote Intentionally designing the future of AI about training on interpretability.
This seems like an instance of The Most Forbidden Technique which has been warned against over and over - optimization pressure on interpretability technique [T] eventually degrades [T].
Goodfire claims they are aware of the associated risks and managing those risks.
Are they properly managing those risks? I would love to get your thoughts on this.