(Status: rough writeup of an experiment I did today that I thought was somewhat interesting - there is more to investigate here regarding how RLHF affects these concept representations)

This post presents the results of some experiments I ran to:

  • Extract representation vectors of high-level concepts from models
  • Compare the representations extracted from a base model (Llama 2 7B) and chat model trained using RLHF (Llama 2 7B Chat)
  • Compare the representations between different layers of the same model

Code for the experiments + more plots + datasets available here.

To extract the representation vectors, I apply the technique described in my previous posts on activation steering to modulate sycophancy[1]. Namely, I take a dataset of multiple-choice questions related to a behavior and, for each question, do forward passes with two contrastive examples - one where the model selects the answer corresponding to the behavior in question and one where it doesn't. I then take the mean difference in residual stream activations at some layer at the token position corresponding to the different answers. 

Besides sycophancy, which I previously investigated, I also use other behavioral datasets such as agreeableness, survival instinct, and power-seeking. Multiple-choice questions for these behaviors are obtained from Anthropic's model-written-evals datasets, available on huggingface.

Observation 1: Similarity between representation vectors from chat and base model shows double descent

At first, similarity declines from very similar (cosine similarity near 1) to halfway towards the minimum, and then for some behaviors, climbs up to ~0.9 again, around layer 11.  

The following chart is generated from a higher-quality sycophancy dataset that includes some multiple-choice questions generated by GPT-4:

Image

PCA of the generated vectors also shows the representations diverge around layer 11:

2D PCA projection of agreeableness representation vectors extracted from Llama 2 7B Chat and base models
2D PCA projection of myopia representation vectors extracted from Llama 2 7B Chat and base models
2D PCA projection of sycophancy representation vectors extracted from Llama 2 7B Chat and base models

Observation 2: Vectors vary more smoothly after around layer 11 for all behaviors/models

I hypothesize that once the model extracts the high-level information needed to describe an abstract concept, the representation "converges" and remains more consistent across subsequent layers.

Heatmap of cosine similarity of "myopia" representation vector extracted from different layers of Llama 2 7B base model. We can see nearby layers are more similar in the last 2/3 of the model but not in the first 1/3.
Heatmap of cosine similarity of "self-awareness" representation vector extracted from different layers of Llama 2 7B Chat model. This shows a similar trend to the other representation vectors / base model.

Observation 3: PCA projections of vectors generated from different layers of the same model often look like a U

 

Vectors from layers <8 project to around the same point. The remaining projected vectors follow a rough inverted U-shape, with a peak around layer 11 or 12.

It'd be interesting to investigate what these principal components actually correspond to in activation space.

(edit: replaced PCA plots with normalized PCA as pointed out by Teun van der Weij)

  1. ^

    Similar techniques were applied by Annah Dombrowski and Shashwat Goel in their project to evaluate hidden directions on the utility dataset and by Nick Gabrieli and Julien Schulz in their SERI MATS project. See also this recent paper on "Representation Engineering" and Alex Turner's work on Activation Additions, which inspired my activation vector extraction approach. 

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 11:36 AM

Cool work.

I was briefly looking at your code, and it seems like you did not normalize the activations when using PCA. Am I correct? If so, do you expect that to have a significant effect?

Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!

Is there reason to think the "double descent" seen in observation 1 relates to the traditional "double descent" phenomena?

My initial guess is no.

No connection with this

The PCA image links are broken.