Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines
Note: This is a research update sharing preliminary results as part of ongoing work. Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions. This holds across multiple models and experimental setups. Summary & Main...
Mar 2034