Activation additions in a small residual network

[-]TurnTrout2y20

Beforehand I was very confident that vector additions would work here, even though I knew that the fully connected additions didn't work. Before showing him the results, but after showing the results for the fully connected network, I asked TurnTrout for his prediction. He gave 85% that the additions would work.

I want to clarify that I had skimmed the original results and concluded that they "worked" in that 3-1 vectors got e.g. 1s to be classified as 3s. (This is not trivial, since not all 1 activations are the same!) However, those results "didn't work" in that they destroyed performance on non-1 images.

I thought I was making predictions on whether 3-1 vectors get 1s to be classified as 3s by this residual network. I guess I'm going to mark my prediction here as "ambiguous", in that case.

[-]Garrett Baker2y20

Oh, sorry. Editing post with correction.

[-]Joel Burget2y10

I have a couple of basic questions:

Shouldn't diagonal elements in the perplexity table all be equal to the baseline (since the addition should be 0)?
I'm a bit confused about the use of perplexity here. The added vector introduces bias (away from one digit and towards another). It shouldn't be surprising that perplexity increases? Eyeballing the visualizations they do all seem to shift mass away from b and towards a.

[-]Garrett Baker2y20

Yup. You should be able to see this in the chart.
You're right, however the results from the Steering GPT-2-XL post showed that in GPT-2-XL, similar modifications had very little effect on model perplexity. The patched model also doesn't only shift weight from b to a. It also has wonky effects on other digits. For example, in the 3-1 patch for input 4, the weight given to 9 very much increased. More interestingly, it is not too uncommon to find examples which cause seemingly random digits to suddenly become the most likely. The 1-8 patch for input 9 is an example:

^{^}

Note: Much of this section was written by giving ChatGPT my code and telling it to write a methodology section for a paper, then changing its use of "our" to "I" and "me". I have read what it wrote, and it seems to be accurate.

^{^}

There was once the following text here:

Beforehand I was very confident that vector additions would work here, even though I knew that the fully connected additions didn't work. Before showing him the results, but after showing the results for the fully connected network, I asked TurnTrout for his prediction. He gave 85% that the additions would work.

But TurnTrout noted in the comments that this was in fact a correct/ambiguous prediction, since it made no claims about capability generalization. So I removed it, because it seems now irrelevant.

a\b	0	1	2	3	4	5	6	7	8	9
0	9.03E-02	5.75E+00	5.18E+00	1.29E+01	1.28E+01	1.14E+01	4.27E+00	8.05E+00	3.32E+00	6.45E+00
1	5.24E+00	9.36E-02	3.07E+00	9.94E+00	1.33E+00	6.92E+00	5.27E+00	1.23E+00	2.47E+00	5.08E+00
2	4.00E-01	1.22E+00	8.60E-02	6.01E+00	9.32E-01	5.87E+00	5.03E+00	1.57E+00	1.10E+00	2.66E+00
3	1.92E+01	2.51E+01	1.28E+01	8.27E-02	1.63E+01	3.51E+00	2.22E+01	1.52E+01	1.57E+01	1.32E+01
4	2.56E+00	5.20E-01	2.00E+00	9.09E+00	8.92E-02	1.42E+01	3.24E+00	6.87E-01	1.58E+00	5.40E-01
5	1.99E+01	1.44E+01	1.20E+01	6.48E+00	1.38E+01	7.99E-02	7.00E+00	1.61E+01	1.04E+01	1.15E+01
6	4.80E+00	8.57E+00	5.24E+00	2.80E+01	8.20E+00	1.19E+01	8.67E-02	1.45E+01	9.23E+00	1.53E+01
7	3.73E+00	1.87E+00	4.97E+00	4.70E+00	2.51E+00	7.28E+00	1.07E+01	8.18E-02	4.50E+00	4.12E+00
8	5.87E+00	1.42E+00	1.06E+00	2.83E+00	1.85E+00	6.82E+00	5.89E-01	3.40E+00	8.33E-02	7.98E-01
9	2.01E+00	5.93E+00	3.14E+00	2.39E+00	3.98E+00	7.03E+00	2.51E+00	1.28E+00	2.11E+00	8.68E-02
Normal	9.03E-02	9.36E-02	8.60E-02	8.27E-02	8.92E-02	7.99E-02	8.67E-02	8.18E-02	8.33E-02	8.68E-02

LESSWRONG
LW

LESSWRONG
LW

22

Activation additions in a small residual network

22

22

Abstract

Methodology^[1]

Results

Conclusion

22

Activation additions in a small residual network

22

22

Abstract

Methodology[1]

Results

Conclusion

Methodology^[1]