x

LESSWRONG
LW

Isolating Vector Additions

From the first post:

Team Shard's recent activation addition methodology for steering GPT-2-XL provokes many questions about what the structure of internal model computation must be in order for their edits to work. Unfortunately, interpreting the insides of neural networks is known to be very difficult, so the question becomes: what is the minimal set of properties a network must have in order for adding activation additions to work?

This sequence collects my posts exploring this question.

Isolating Vector Additions

From the first post:

Team Shard's recent activation addition methodology for steering GPT-2-XL provokes many questions about what the structure of internal model computation must be in order for their edits to work. Unfortunately, interpreting the insides of neural networks is known to be very difficult, so the question becomes: what is the minimal set of properties a network must have in order for adding activation additions to work?

This sequence collects my posts exploring this question.

26Activation additions in a simple MNIST network

Garrett Baker

3y

0

22Activation additions in a small residual network

Garrett Baker

3y

4