Do Bilinear MLPs Actually Learn Cleaner Circuits?
I've been reading about mechanistic interpretability for the past few months and came across a claim that bilinear MLPs are "interpretable by construction." When I read Pearce et al.'s paper, they demonstrated that bilinear layers can be decomposed into interaction tensors, allowing for direct analysis. That's interesting, but it left...
Jan 123