Goodfire and Training on Interpretability

Satya Benson

LESSWRONG
LW

Goodfire and Training on Interpretability — LessWrong

Interpretability (ML & AI)OptimizationAI

25 Goodfire and Training on Interpretability

by Satya Benson

6th Feb 2026

1 min read

0 1

25

Goodfire wrote Intentionally designing the future of AI about training on interpretability.

This seems like an instance of The Most Forbidden Technique which has been warned against over and over - optimization pressure on interpretability technique [T] eventually degrades [T].

Goodfire claims they are aware of the associated risks and managing those risks.

Are they properly managing those risks? I would love to get your thoughts on this.

Interpretability (ML & AI)OptimizationAI

25

Goodfire and Training on Interpretability

2Satya Benson

New Answer

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 3:53 AM

[-]Satya Benson2h20

To elaborate a bit (but you should go read the posts!) the classic Most Forbidden Technique scenario goes like this: you detect "bad thoughts" in the CoT/activations, you penalize them, and the model learns to have the same bad thoughts in ways your detector can't see.

Goodfire seems to be doing something structurally different, by decomposing the gradient into semantic components and choosing which components to apply. Does that really make the the difference, thought? I don't know!

Moderation Log

25

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 3:53 AM

[-]Satya Benson2h20

Moderation Log

25

[ Question ]

Goodfire and Training on Interpretability

25

25

25