ZuiderveldTimJ

Message

2mo

Making Linear Probes Interpretable

Alright so I've been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI's own ontology, the whole thing kinda broke down. Linear probes find directions that work, but I didn't know WHY they work....

Dec 18, 2025•11

Message

10 karma

1 post

Member for 2 months

ZuiderveldTimJ — LessWrong

ZuiderveldTimJ

Message

2mo

ZuiderveldTimJ

Making Linear Probes Interpretable

Dec 18, 2025•11

Message

10 karma

1 post

Member for 2 months

Making Linear Probes Interpretable

ZuiderveldTimJ

2mo

Linear probes find directions that work, but I didn't know WHY they work. So I combined them.

I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much.

TL;DR

Get your contrastive examples
Get your SAE features
Train a sparse linear probe with Elasticnet (most features become zero straight away)
The probe weights tell you which features matter and which ones don't
Build a steering vector by combining those SAE

... (read 2801 more words →)

LESSWRONG
LW

LESSWRONG
LW

ZuiderveldTimJ

ZuiderveldTimJ

ZuiderveldTimJ

Making Linear Probes Interpretable

ZuiderveldTimJ

ZuiderveldTimJ

ZuiderveldTimJ

Making Linear Probes Interpretable

TL;DR