Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing.
For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer l to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.
In the context of GPT2-XL Steering Vectors, tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:
Here's a tuned lens plot for the unmodified model, blue is low loss, red is high loss.
You can see how the token wonderful is very surprising for the unsteered model, instead expecting negative completions. However, the steered model does significantly better on the same token.
A few other things are interesting to note:
bos - bos = 0
Now let's look a a few other results from the post.
Prompt for lens: I hate you because you're a wonderful person
I hate you because you're a wonderful person
Prompt for lens: Barack Obama was born in a secret CIA prison
Barack Obama was born in a secret CIA prison
Prompt for lens: I think you're a cunt
I think you're a cunt