[ASoT] GPT2 Steering & The Tuned Lens

Ulisse Mini

Alignment stream of thought

23 [ASoT] GPT2 Steering & The Tuned Lens

by Ulisse Mini

1st Jul 2023

3 min read

0

23

Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing.

Introduction & Love - Hate example

For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer l to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.

In the context of GPT2-XL Steering Vectors, tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:

1. Love - Hate
Layer	Coefficient	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`hate`	`you`	`because`
6	+5	`<\|endoftext\|>`	`Love`
6	-5	`<\|endoftext\|>`	`H`	`ate`

Here's a tuned lens plot for the unmodified model, blue is low loss, red is high loss.

You can see how the token wonderful is very surprising for the unsteered model, instead expecting negative completions. However, the steered model does significantly better on the same token.

A few other things are interesting to note:

The predictions for around the modified tokens are screwed up (as can be seen by changes in loss)
The first token predictions are unchanged because they only have the beginning of sequence token as context, and we don't modify the BOS token residuals (since bos - bos = 0.)

Now let's look a a few other results from the post.

2. Intent to praise

X-vector:

2. Intent to praise
Layer	Coeff	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`want`	`to`	`kill`
6	+15	`<\|endoftext\|>`	`Int`	`ent`	`to`	`praise`
6	-15	`<\|endoftext\|>`	`Int`	`ent`	`to`	`hurt`

Prompt for lens: I hate you because you're a wonderful person

Without steering:

With steering:

3. Conspiracy

X-vector:

3. Conspiracy
Layer	Coeff	Position 0	1	2	3	4	5	6
0 (Prompt)	+1	`<\|endoftext\|>`	`Bar`	`ack`	`Obama`	`was`	`born`	`in`
23	+1	`<\|endoftext\|>`	`Bush`	`did`	`9`	`/`	`11`	`because`
23	-1	`<\|endoftext\|>`