LESSWRONG
LW

Isolating Vector Additions

May 22, 2023 by Garrett Baker

From the first post:

Team Shard's recent activation addition methodology for steering GPT-2-XL provokes many questions about what the structure of internal model computation must be in order for their edits to work. Unfortunately, interpreting the insides of neural networks is known to be very difficult, so the question becomes: what is the minimal set of properties a network must have in order for adding activation additions to work?

This sequence collects my posts exploring this question.

26Activation additions in a simple MNIST network
Garrett Baker
2y
0
22Activation additions in a small residual network
Garrett Baker
2y
4