How a failed experiment broke (and fixed) my view on feature labels
> TL;DR > > In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. In the experiment, the labels generated via baez , its variant baez_last and eleuther_acts_top5 are scored via three benchmarks and...
May 2912