Atmyre — LessWrong

There doesn't seem to be anything surprising that there are many orthogonal steering vectors for the layer 8 that results in similar behaviour (e.g. coding). The neural network between layers 8 and 16 can be seen as a nonlinear function that transforms outputs of layer 8 into the outputs of layer 16. As this function is nonlinear, it is not guaranteed to project orthogonal inputs into orthogonal outputs, i.e. there might exist multiple orthogonal inputs that are projected into almost collinear outputs by this function. And what your algorithm for learning these orthogonal steering vectors does is that it explicitly finds (learns) these orthogonal inputs that are projected into almost collinear outputs.

We can see that also experimentally. Let's take one prompt (e.g. "How do I feed the cat?") and steer the model activations on the layer 8 by different orthogonal steering "coding" vectors found by you. And let's save activations of the layer 16 correspomding to this input prompt and different steering vectors. We'll see that cosine similarities between these outputs of layer 16 are quite high. E.g. here's a scatter plot of cosine similarities between activations of layer 16 computed using steering vector number 0 and number i, for 0 \leq i \leq 1558: ![image_steered](https://ibb.co/0pmLScH3). We see that steering activations of layer 8 with different "coding" steering vectors results in highly collinear activations of the layer 16, and as the activations of the layer 16 are collinear, they will result in a similar outputs of the entire model. Note that last steering vectors are less collinear with the first steering vector, and experimentally we also see that last steering vectors not always result in the model outputting code.

Here is also a scatter plot of cosine similarities between activations of layer 16 computed using randomly initialized steering vectors: ![image_steered](https://ibb.co/rRpdPKKM). We see that cosine similarities of outputs of layer 16 are quite low here, suggesting that high cosine similarities on the case of "coding" steering vectors was not a coincidence.

These exps and plots were done in this jupyter notebook based on your GitHub code: https://colab.research.google.com/drive/11F1VnEbKqgH-4aTGz8Zxi40qFKb1wEDq?usp=sharing

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments