x
Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines — LessWrong