(Status: rough writeup of an experiment I did today that I thought was somewhat interesting - there is more to investigate here regarding how RLHF affects these concept representations)

This post presents the results of some experiments I ran to:

Extract representation vectors of high-level concepts from models

Compare the representations extracted from a base model (Llama 2 7B) and chat model trained using RLHF (Llama 2 7B Chat)

Compare the representations between different layers of the same model

Code for the experiments + more plots + datasets available here.

To extract the representation vectors, I apply the technique described in my previousposts on activation steering to modulate sycophancy^{[1]}. Namely, I take a dataset of multiple-choice questions related to a behavior and, for each question, do forward passes with two contrastive examples - one where the model selects the answer corresponding to the behavior in question and one where it doesn't. I then take the mean difference in residual stream activations at some layer at the token position corresponding to the different answers.

Besides sycophancy, which I previously investigated, I also use other behavioral datasets such as agreeableness, survival instinct, and power-seeking. Multiple-choice questions for these behaviors are obtained from Anthropic's model-written-evals datasets, available on huggingface.

Observation 1: Similarity between representation vectors from chat and base model shows double descent

At first, similarity declines from very similar (cosine similarity near 1) to halfway towards the minimum, and then for some behaviors, climbs up to ~0.9 again, around layer 11.

The following chart is generated from a higher-quality sycophancy dataset that includes some multiple-choice questions generated by GPT-4:

PCA of the generated vectors also shows the representations diverge around layer 11:

Observation 2: Vectors vary more smoothly after around layer 11 for all behaviors/models

I hypothesize that once the model extracts the high-level information needed to describe an abstract concept, the representation "converges" and remains more consistent across subsequent layers.

Observation 3: PCA projections of vectors generated from different layers of the same model often look like a U

Vectors from layers <8 project to around the same point. The remaining projected vectors follow a rough inverted U-shape, with a peak around layer 11 or 12.

It'd be interesting to investigate what these principal components actually correspond to in activation space.

I was briefly looking at your code, and it seems like you did not normalize the activations when using PCA. Am I correct? If so, do you expect that to have a significant effect?

Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!

(Status: rough writeup of an experiment I did today that I thought was somewhat interesting - there is more to investigate here regarding how RLHF affects these concept representations)This post presents the results of some experiments I ran to:

Code for the experiments + more plots + datasets availablehere.To extract the representation vectors, I apply the technique described in my previous posts on activation steering to modulate sycophancy

^{[1]}. Namely, I take a dataset of multiple-choice questions related to a behavior and, for each question, do forward passes with two contrastive examples - one where the model selects the answer corresponding to the behavior in question and one where it doesn't. I then take the mean difference in residual stream activations at some layer at the token position corresponding to the different answers.Besides sycophancy, which I previously investigated, I also use other behavioral datasets such as agreeableness, survival instinct, and power-seeking. Multiple-choice questions for these behaviors are obtained from Anthropic's model-written-evals datasets, available on huggingface.

## Observation 1: Similarity between representation vectors from chat and base model shows double descent

At first, similarity declines from very similar (cosine similarity near 1) to halfway towards the minimum, and then for some behaviors, climbs up to ~0.9 again, around layer 11.

The following chart is generated from a higher-quality sycophancy dataset that includes some multiple-choice questions generated by GPT-4:

PCA of the generated vectors also shows the representations diverge around layer 11:

## Observation 2: Vectors vary more smoothly after around layer 11 for all behaviors/models

I hypothesize that once the model extracts the high-level information needed to describe an abstract concept, the representation "converges" and remains more consistent across subsequent layers.

## Observation 3: PCA projections of vectors generated from different layers of the same model often look like a U

Vectors from layers <8 project to around the same point. The remaining projected vectors follow a rough inverted U-shape, with a peak around layer 11 or 12.

It'd be interesting to investigate what these principal components actually correspond to in activation space.

(edit: replaced PCA plots with normalized PCA aspointed out by Teun van der Weij)^{^}Similar techniques were applied by Annah Dombrowski and Shashwat Goel in their project to evaluate hidden directions on the utility dataset and by Nick Gabrieli and Julien Schulz in their SERI MATS project. See also this recent paper on "Representation Engineering" and Alex Turner's work on Activation Additions, which inspired my activation vector extraction approach.