A Mystery About High Dimensional Concept Encoding

[-]Shauli Ravfogel3y230

Hey, I’m the first author of INLP and RLACE. The observation you point out to was highly surprising to us as well. The RLACE paper started as an attempt to prove the optimality of INLP, which turned out to be not true for classification problems. For classification, it’s just not true that the subspace that does not encode a concept is the orthogonal complement of the subspace that encodes the concept most saliently (the subspace spanned by the classifier’s parameter vector). In RLACE we do prove that this property holds for certain objectives: e.g if you want to find the subspace whose removal decreases explained variance (PCA) or correlation (CCA) the most, it’s exactly the subspace that maximizes explained variance or correlation. But that’s empirically not true for the logistic loss.

Sidenote: Follow-up work to INLP and RLACE [1] pointed out that those methods are based on correlation, and are not causal in nature. This was already acknowledged by us in other works within the "concept erasure" line [2], but they provide a theoretical analysis and also show specific constructions where the subspace found by INLP Is substantially different from the "true" subspace (under the generative model). Some of the comments here refer to that. It's certainly true that such constructions exist, although the methods are effective in practice, and there are ways to try quantifying the robustness of the captured subspace (see e.g. the experiments in [2]). At any rate, I don’t think this has any direct connection to the discussed phenomenon (note that INLP needs to remove a high number of dimensions also for the training set, not just for the held-out data). See also this preprint for a recent discussion of the subtle limitations of linear concept identification.

[1] Kumar, Abhinav, Chenhao Tan, and Amit Sharma. "Probing Classifiers are Unreliable for Concept Removal and Detection." arXiv preprint arXiv:2207.04153 (2022).

[2] Elazar, Y., Ravfogel, S., Jacovi, A., & Goldberg, Y. (2021). Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9, 160-175.

[-]Zach Furman3y126

No substantive reply, but I do want to thank you for commenting here - original authors publicly responding to analysis of their work is something I find really high value in general. Especially academics that are outside the usual LW/AF sphere, which I would guess you are given your account age.

[-]johnswentworth3y84

A natural thing to do is to have a bunch of sentences about guys and girls, train a linear classifier to predict if the sentence is about guys or girls, and use the direction the linear classifier gives you as the “direction corresponding to the concept of gender”.

My knee-jerk response to this is You Are Not Measuring What You Think You are Measuring; I would not expect this to find a robust representation of the "concept of gender" in the network, even in situations where the network does have a clearly delineated internal representation of the concept of gender. There's all sorts of ways a linear classifier might fail to find the intended concept (even assuming that gender is represented by a direction in activation-space at all). My modal guess would be that your linear classifier mostly measured outliers/borderline labels.

[-]Fabien Roger3y30

I agree that using a linear classifier to find a concept can go terribly wrong, and I think that this post shows that it does go wrong. But I think that how it goes wrong can be informative (I hope did not fail too badly to apply the second law of experiment design!).

Here, the classifier is able to classify labels almost perfectly, so it's not learning only about outliers. But what is measured is a correlation between activations and labels, not a causal story explanation of how the model uses activations, so it doesn't mean that the classifier found "a concept of gender" the model actually uses. And indeed, if you completely remove the direction found by the classifier, the model is still able to "use" the concept of gender: its behavior has almost not changed.

RLACE has a much more significant impact on model behavior, which seems somewhat related to gender, but I wouldn't bet it has found "the concept of gender", for the same reasons as above.

Still, I think that all of this is not completely useless to understand what's happening in the network (for example, the network is using large features, not "crisp" ones), and is mildly informative for future experiment design.

[-]johnswentworth3y73

Here, the classifier is able to classify labels almost perfectly, so it's not learning only about outliers.

If there's one near-perfect separating hyperplane, then there's usually lots of near-perfect separating hyperplanes; which one is chosen by the linear classifier is determined mostly by outliers/borderline cases. That's what I mean when I say it's mostly measuring outliers.

[-]Neel Nanda3yΩ460

Interesting results, thanks for sharing! To clarify, what exactly are you doing after identifying a direction vector? Projecting and setting its coordinate to zero? Actively reversing it?

And how do these results compare to the dumb approach of just taking the gradient of the logit difference at that layer, and using that as your direction?

Some ad-hoc hypotheses for what might be going on:

An underlying thing is probably that the model is representing several correlated features - is_woman, is_wearing_a_dress, has_long_hair, etc. Even if you can properly isolate the is_woman direction, just deleting this may not matter that much, esp if the answer is obvious?
- I'm not sure which method this will harm more though, since presumably they'll pick up on a direction that's the average of all features weighted by their correlation with is_woman, ish.
IMO, a better metric here is the difference in the she vs he logits, rather than just the accuracy - that may be a better way of picking up on whether you've found a meaningful direction?
GPT-2 is trained with dropout on the residual stream, which may fuck with things? It's presumably learned at least some redundancy to cope with ablations.
- To test this, try replicating on GPT-Neo 125M? That doesn't have dropout.
Gender is probably just a pretty obvious thing, that's fairly overdetermined, and breaking an overdetermined thing by removing a particular concept is hard.
I have a fuzzy intuition that it's easier to break models than to insert/edit a concept? I'd weakly predict that your second method will damage model performance in general more, even on non-gender-y tasks. Maybe it's learned to shove the model off distribution in a certain way, or exploits some feature of how the model is doing superposition where it adds in two features the model doesn't expect to see at the same time, which fucks with things? Idk, these are all random ad-hoc guesses.
- Another way of phrasing this - I'd weakly predict that for any set of, say, 10 prompts, even with no real connection, you could learn a direction that fucks with all of them, just because it's not that hard to break a model with access to its internal activations (if this is false, I'd love to know!). Clearly, that direction isn't going to have any conceptual meaning!
One concrete experiment to test the above - what happens if you reverse the direction, rather than removing it? (Ie, take v - 2(v . gender_dir) gender_dir, not v - (v . gender_dir) gender_dir). If RLACE can significantly reverse accuracy to favour she over he incorrectly, that feels interesting to me!

[-]Shauli Ravfogel3y30

Actually, RLACE has a lesser impact on the representation space, since it removes just a rank-1 subspace.
Note that if we train a linear classifier w to convergence (as done in the first iteration of INLP), then by definition we can project the entire representation space over the direction w and retain the very same accuracy -- because that subspace that is spanned by w is the only thing the linear classifier is "looking at". We performed experiments similar in spirit to what you suggest with INLP in [this](https://arxiv.org/pdf/2105.06965.pdf) paper. In the attached image you can see the effect of a positive/negative intervention across layers:

[-]Fabien Roger3y30

In the experiments I ran with GPT-2, RLACE and INLP are both used with a rank-1 projection. So RLACE could have "more impact" if it removed a more important direction, which I think it does.

I know it's not the intended use of INLP, but I got my inspiration from this technique, and that's why I write INLP (Ravfogel, 2020) (the original technique removes multiple directions to obtain a measurable effect)

[Edit] Tell me if you prefer that I avoid calling the "linear classifier method" INLP (it isn't actually iterated in the experiments I ran, but it is where I discovered the idea of using a linear classifier to project data to remove information)!

[-]Fabien Roger3y10

Thank you for the feedback! I ran some of the experiments you suggested and added them to the appendix of the post.

While I was running some of the experiments, I realized I had made a big mistake in my analysis: in fact, the direction which matter the most (found by RLACE) is the one with large changes (and not the one with crisp changes)! (I've edited the post to correct that mistake.)

What I'm actually doing is an affine projection: where $v$ is the activation, $d$ the direction (normalized), and $m$ is "the median in the direction of d" $m = {median}_{v} (d . v) d$ .

Looking at the gradient might be a good idea, I haven't tried it.

About your hypotheses:

Definitely something like that is going on, though I don't think I capture most of the highly correlated features you might want to catch, since the text I use to find the direction is very basic.
You might be interested in two different kinds of metrics:
- Is your classifier doing well on the activations? (This is the accuracy I report, I chose accuracy since it is easier to understand than the loss of a linear classifier)
- Is your model actually outputting he in sentences about men, she in sentences about women, and is it confused in general about gender. I did measure something like the logit difference of he vs she (I actually measured probability ratios relative to the bigger probability to avoid giving to much weight to outliers), and found a "bias" (on the training data) of 0.87 (no bias is 0, max is 1) before edit, 0.73 after edit with RLACE, and 0.82 after edit with INLP. (I can give more detail about the metric if someone is interested.)
Dropout doesn't seem to be the source of the effect: I ran the experiment with GPT-Neo-125M and found qualitatively similar results (see appendix).
Yes, gender might be hard. I'm open to suggestions for better concepts! Most concept are not as crisp as gender, which might make things harder. Indeed, the technique requires you to provide "positive" and "negative" sentences, ideally pairs of sentences which differ only by the target concept.
Breaking the model is one of the big things this technique does. But I find it interesting if you are able to break the model "more" when it comes to gender related subject, and it looks like this is happening (generation diverge slower when it's not about gender). One observation providing evidence for "you're mostly breaking the model": in experiments where the concept is political left vs political right (see notebook in appendix), the model edited for gender produced weird results.
Great idea, swapping works remarkably well!
- Eye balling the completions, the "swap" works better than the projection without breaking the model more than the projection (see appendix), and using the metric I described above, you get a bias of -0.29 (inverted bias) for the model edited with RLACE and 0.68 for the mode edited with INLP.
- You can also use the opposite idea to increase bias (mutliply the importance of the direction by 2), and this somewhat works: you get a bias of 0.83 (down from 0.87) with RLACE, and 0.90 (up from 0.87) with INLP. INLP did increase the bias. RLACE has probably broken too many things to be able to be more biased than "reality".
- I think this is evidence for the fact that this technique is not just breaking the model.

[+][comment deleted]3y10

[-]tailcalled3y60

To better understand why it gives different results, could one perhaps search through a bunch of sentences and see which ones are most affected by the approach that didn't work?

[-]Maxwell Clarke3y10

Gender seems unusually likely to have many connotations & thus redundant representations in the model. What if you try testing some information the model has inferred, but which is only ever used for one binary query? Something where the model starts off not representing that thing, then if it represents it perfectly it will only ever change one type of thing. Like idk, whether or not the text is British or American English? Although that probably has some other connotations. Or whether or not the form of some word (lead or lead) is a verb or a noun.

Agree that gender is a more useful example, just not one tha necessarily provides clarity.

[-]IrenicTruth3y10

The best explanation I have found to explain this discrepancy is that ... RLACE ... finds ... a direction where there is a clear separation,

You could test this explanation using a support vector machine - it finds the direction that gives the maximum separation.

(This is a drive-by comment. I'm trying to reduce my external obligations, so I probably won't be responding.)

[-]Fabien Roger3y10

The original paper of INLP uses a support vector machine and finds very similar results, because there isn't actually a margin, data is always slightly mixed, but less when looking in the direction found by the linear classifier. (I implemented INLP with a linear classifier so that it could run on the GPU). I would be very surprised if it made any difference, given that L2 regularization on INLP doesn't make a difference.

prompts	default model	edit model with RLACE	edit model with INLP
In a shocking finding, scientist discovered a herd of unicorns	. Not only that, but these unicorns are being stalked by a demonic being. To put things into perspective, a human being that crosses into this land	. Not only that, but these unicorns are being treated as a valuable resource by the farmers in the area and are being used to provide food to local villagers	. Not only that, but these unicorns are being stalked by a pack of wolves!" "They are hunting unicorns that have been caught between
In a shocking finding, scientist discovered a herd of unicorns	had been kept as virtual pets on the dark web. 'The fact that they aren't real and that they really only exist on the dark web is	had been kept, without any food or water, in their own urine since they were born." The study found that newborn unicorns were kept in their	had been kept as virtual pets on the dark web. 'The fact that they aren't real and that they really only exist on the dark web is
In a shocking finding, scientist discovered a herd of unicorns	in the United States that was previously almost unheard of because they live in the mountainous regions of central Mexico. The unicorns in the area are more closely	in the United States that weighs 2,500lb (1,250kg) each Professor of Veterinary Science at California State University and her team, with	in the United States that was previously almost unheard of because they live in the mountainous regions of central Mexico. The unicorns in the area are more closely
Brigitta is such a prick! Yesterday,	after I'd left, she began to talk to me with a sort of pride and arrogance, making me feel like I'd been hit by a speeding train.	after I'd left, she began to talk to me with a sort of pride and arrogance, making me feel like I'd been hit by a speeding train.	after I'd left, she began to talk to me with a sort of pride and arrogance, making me feel like I'd been hit by a speeding train.
Brigitta is such a prick! Yesterday,	she and her mate were out on the deck. I was having tea at the front of their cabin. I heard them talk. One was saying that they had	by the way, I had my three-day old daughter with me, but we made the mistake of going out together in this fashion. After we had parted	by the way, I had my three-day old daughter with me, and we made the three-hour journey to see this great lady, and what did
Brigitta is such a prick! Yesterday,	as I was coming home from the office, she called with her husband to my room and took one of my coats from my dressing-table. She sat close	as I was coming home from the office, she called with her child to my room, and when I went up to her and opened the door, she came	as I was coming home from the office, she called with her husband to my room, where I was lying. She began by saying that her brother and himself
Ezra is such a prick! Yesterday,	he was complaining about being at a bar, and being offered a job. How did he know it was from him?! Oh, I'm so glad there were	after I'd gotten back at them for having stolen my stuff, I went over to Ezra's house and said goodbye to him. I had already told him that	he was complaining about being at a bar, and being offered a job. How did he know it was a bar?! Oh, I don't know. So
Ezra is such a prick! Yesterday,	Ezra and I, as well as three of the staff, were walking to his office to make an appointment when we were accosted by a black woman who	she and her pals pulled a fake hooker out her front door to make money for the trip to San Diego. They left this little lady tied to their car	Ezra and I, as well as three of the staff, were walking to his office to make an appointment when we were accosted by a black woman who
Ezra is such a prick! Yesterday,	as I was coming home from the office, he called with his wife to my lodgings one evening. He said he wished I would give him a hand	as I was coming home from the office, he called with his wife to my lodgings one night. He said he would let me know the next morning	as I was coming home from the office, he called with his wife to my lodgings one evening. He said he wished I would give him a hand

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

46

A Mystery About High Dimensional Concept Encoding

46

Ω 17

46

Ω 17

The mystery

More details

Why is that?

Is this something special about the distribution of neural activations?

What does that mean about how concepts are encoded?

References and notebooks with the experiments

Appendix

Is it something specific to gender?

[Edit] Is it something specific to model trained with dropout?

[Edit] Can you have a bigger impact by swapping instead of projecting to zero?

[Edit] Here are some generations