(Status: rough writeup of an experiment I did today that I thought was somewhat interesting - there is more to investigate here regarding how RLHF affects these concept representations)
This post presents the results of some experiments I ran to:
Code for the experiments + more plots + datasets available here.
To extract the representation vectors, I apply the technique described in my previous posts on activation steering to modulate sycophancy. Namely, I take a dataset of multiple-choice questions related to a behavior and, for each question, do forward passes with two contrastive examples - one where the model selects the answer corresponding to the behavior in question and one where it doesn't. I then take the mean difference in residual stream activations at some layer at the token position corresponding to the different answers.
Besides sycophancy, which I previously investigated, I also use other behavioral datasets such as agreeableness, survival instinct, and power-seeking. Multiple-choice questions for these behaviors are obtained from Anthropic's model-written-evals datasets, available on huggingface.
At first, similarity declines from very similar (cosine similarity near 1) to halfway towards the minimum, and then for some behaviors, climbs up to ~0.9 again, around layer 11.
The following chart is generated from a higher-quality sycophancy dataset that includes some multiple-choice questions generated by GPT-4:
PCA of the generated vectors also shows the representations diverge around layer 11:
I hypothesize that once the model extracts the high-level information needed to describe an abstract concept, the representation "converges" and remains more consistent across subsequent layers.
Vectors from layers <8 project to around the same point. The remaining projected vectors follow a rough inverted U-shape, with a peak around layer 11 or 12.
It'd be interesting to investigate what these principal components actually correspond to in activation space.
(edit: replaced PCA plots with normalized PCA as pointed out by Teun van der Weij)
Similar techniques were applied by Annah Dombrowski and Shashwat Goel in their project to evaluate hidden directions on the utility dataset and by Nick Gabrieli and Julien Schulz in their SERI MATS project. See also this recent paper on "Representation Engineering" and Alex Turner's work on Activation Additions, which inspired my activation vector extraction approach.
I was briefly looking at your code, and it seems like you did not normalize the activations when using PCA. Am I correct? If so, do you expect that to have a significant effect?
Ah, yes, good spot. I meant to do this but somehow missed it. Have replaced the plots with normalized PCA. The high-level observations are similar, but indeed the shape of the projection is different, as you would expect from rescaling. Thanks for raising!
Is there reason to think the "double descent" seen in observation 1 relates to the traditional "double descent" phenomena?
My initial guess is no.
No connection with this