Interesting post. We explored similar work during a MATS stream, training different MoE designs to get more interpretable experts. We started by just testing increasingly sparse MoEs (partly inspired by that Monet paper) on the logic that smaller experts = tighter specialization, then moved on to things like orthogonality constraints, etc.
We were pretty pessimistic from the results at first. Individual experts didn't seem to specialize in anything you wouldn't get from just running k-means on the residual stream (i.e., no real interp benefit). This is sort... (read more)
Interesting post. We explored similar work during a MATS stream, training different MoE designs to get more interpretable experts. We started by just testing increasingly sparse MoEs (partly inspired by that Monet paper) on the logic that smaller experts = tighter specialization, then moved on to things like orthogonality constraints, etc.
We were pretty pessimistic from the results at first. Individual experts didn't seem to specialize in anything you wouldn't get from just running k-means on the residual stream (i.e., no real interp benefit). This is sort... (read more)