LESSWRONG
LW

142
Dmitrii Kharlapenko
114Ω45220
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Self-explaining SAE features
Dmitrii Kharlapenko1y20

Do you mean SAE encoder weights by input features? We did not look into them.

Reply
Self-explaining SAE features
Dmitrii Kharlapenko1y40

Thanks! We did try to use it in the repeat setting to make the model produce more than a single token, but it did not work well.

And as far as I remember it also did not improve the meaning prompt much.

Reply1
22Evolutionary prompt optimization for SAE feature visualization
Ω
1y
Ω
0
29SAE features for refusal and sycophancy steering vectors
Ω
1y
Ω
4
31Extracting SAE task features for in-context learning
Ω
1y
Ω
1
62Self-explaining SAE features
Ω
1y
Ω
13