x

LESSWRONG

LW

kaustubh — LessWrong

kaustubh

kaustubh

Message

2

1

4mo

kaustubh

2

4mo

Do Bilinear MLPs Actually Learn Cleaner Circuits?

I've been reading about mechanistic interpretability for the past few months and came across a claim that bilinear MLPs are "interpretable by construction." When I read Pearce et al.'s paper, they demonstrated that bilinear layers can be decomposed into interaction tensors, allowing for direct analysis. That's interesting, but it left...