Francisco Ferreira da Silva

Message

Mech interp. Senior Fellow with Pivotal Research. ffsilva.io

8mo

Francisco Ferreira da Silva

Mech interp. Senior Fellow with Pivotal Research. ffsilva.io

Compressed Computation under L⁴ Loss is likely Computation in Superposition

Summary Neural networks are widely assumed to use superposition to represent more features than they have dimensions. A stronger claim is that they also compute in superposition (CiS), i.e., implement more nonlinear functions than they have neurons (Hänni et al. 2024). CiS remains poorly understood, and until recently there were...

Jul 15•22

Evidence for feature-specific error correction in LLMs

LLMs are commonly assumed to use superposition to represent more features than they have dimensions. The evidence for this is mostly indirect — chiefly the success of SAEs at extracting interpretable directions. A stronger claim is that models also compute in superposition, and for that we have only theoretical evidence....

Jul 14•24

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Note: This is a research update sharing preliminary results as part of ongoing work. Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions. This holds across multiple models and experimental setups. Summary & Main...

Mar 20•39