LESSWRONG
LW

6153
Jonathan Kutasov
26110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Interpretability of SAE Features Representing Check in ChessGPT
Jonathan Kutasov1y10

Thanks for the suggestion! This sounds pretty cool and I think would be worth trying.

One thing that might make this a bit tricky is finding the right subset of the data to feed into Claude. Each feature only fires very rarely so it can be easy to fool yourself into thinking that you found a good classifier when you haven’t.

For example, many of the features we found only fire when they see check. However, many cases of check don’t activate the feature. The problem we ran into is that check is such an infrequent occurrence that you can only get a good number of samples showing check by taking a ton of examples overall, or by upweighting the check class in your sampling.

So if we show Claude all the examples where a feature fired and then some equal number of randomly chosen examples where it didn’t, chances are that just using “is in check” will be a great classifier. I think we can get around this with prompting Claude to find as many restrictions as possible, but sort of an interesting thing that might come up.

Reply
27Interpretability of SAE Features Representing Check in ChessGPT
1y
2