Joseph Miller

Wiki Contributions

Comments

Sorted by

Question –> CoT –> Answer

So to be clear, testing whether this causal relationship holds is actually important, it's just that we need to do it on questions where the CoT is required for the model to answer the question?

Optimize the steering vector to minimize some loss function.

Crossposted from https://x.com/JosephMiller_/status/1839085556245950552

1/ Sparse autoencoders trained on the embedding weights of a language model have very interpretable features! We can decompose a token into its top activating features to understand how the model represents the meaning of the token.🧵

2/ To visualize each feature, we project the output direction of the feature onto the token embeddings to find the most similar tokens. We also show the bottom and median tokens by similarity, but they are not very interpretable.

3/ The token "deaf" decomposes into features for audio and disability! None of the examples in this thread are cherry-picked – they were all (really) randomly chosen.

4/ Usually SAEs are trained on the internal activations of a component for billions of different input strings. But here we just train on the rows of the embedding weight matrix (where each row is the embedding for one token).

5/ Most SAEs have many thousands of features. But for our embedding SAE, we only use 2000 features because of our limited dataset. We are essentially compressing the embedding matrix into a smaller, sparser representation.

6/ The reconstructions are not highly accurate – on average we have ~60% variance unexplained (~0.7 cosine similarity) with ~6 features active per token. So more work is needed to see how useful they are.

7/ Note that for this experiment we used the subset of the token embeddings that correspond to English words, so the task is easier - but the results are qualitatively similar when you train on all embeddings.

8/ We also compare to PCA directions and find that the SAE directions are in fact much more interpretable (as we would expect)!

9/ I worked on embedding SAEs at an @apartresearch hackathon in April, with Sajjan Sivia and Chenxing (June) He.
Embedding SAEs were also invented independently by @Michael Pearce.

Nice post. I think this is a really interesting discovery.

[Copying from messages with Joseph Bloom]
TLDR: I'm confused what is different about the SAE input that causes the absorbed feature not to fire.

Me:

Summary of your findings

  • Say you have a “starts with s” feature and a “snake” feature.
  • You find that for most words, “starts with s” correctly categorizes words that start with s. But for a few words that start with s, like snake, it doesn’t fire.
  • These exceptional tokens where it doesn’t fire, all have another feature that corresponds very closely to the token. For example, there is a “snake” feature that corresponds strongly to the snake token.
  • You say that the “snake” feature has absorbed the “starts with s” feature because the concept of snake also contains/entails the concept of ‘start with s’.
  • Most of the features that absorb other features correspond to common words, like “and”.

So why is this happening? Well it makes sense that the model can do better on L1 on the snake token by just firing a single “snake” feature (rather than the “starts with s” feature and, say, the “reptile” feature). And it makes sense it would only have enough space to have these specific token features for common tokens.

Joseph Bloom:

rather than the “starts with s” feature and, say, the “reptile” feature

We found cases of seemingly more general features getting absorbed in the context of spelling but they are more rare / probably the exception. It's worth distinguishing that we suspect that feature absorption is just easiest to find for token aligned features but conceptually could occur any time a similar structure exists between features.

And it makes sense it would only have enough space to have these specific token features for common tokens.

I think this needs further investigation. We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token). I predict there is a strong density effect but it could be non-trivial.

Me:

We found cases of seemingly more general features getting absorbed in the context of spelling

What’s an example?

We certainly sometimes see rarer tokens which get absorbed (eg: a rare token is a translated word of a common token)

You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?

Do this only happen if the french word also starts with s?

Joseph Bloom:

What’s an example?

You mean like the “starts with s” feature could be absorbed into the “snake” feature on the french word for snake?

Yes

Do this only happen if the french word also starts with s?

More likely. I think the process is stochastic so it's all distributions.

↓[Key point]↓

Me:

But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? How is it able to fire on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature? I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?

Joseph Bloom:

I would assume that the token embeddings of the model contain some “starts with s” direction. And the “starts with s” feature input weights read off this direction. So why wouldn’t it also activate on “snake”? Surely that token embedding also has the “starts with s” direction?

I think the success of the linear probe is why we think the snake token does have the starts with s direction. The linear probe has much better recall and doesn't struggle with obvious examples. I think the feature absorption work is not about how models really work, it's about how SAEs obscure how models work.

But here’s what I’m confused about. How does the “starts with s” feature ‘know’ not to fire? Like what is the mechanism by which it fires on all words that start with s, except those tokens (like “snake”) that having a strongly correlated feature?

Short answer, I don't know. Long answer - some hypotheses:

  1. Linear probes, can easily do calculations of the form "A AND B". In large vector spaces, it may be possible to learn a direction of the form "(^S.*) AND not (snake) and not (sun) ...". Note that "snake" has a component seperate to starts with s so this is possible.  To the extent this may be hard, that's possibly why we don't see more absorption but my own intuition says that in large vector spaces this should be perfectly possible to do.
  2. Encoder weights and Decoder weights aren't tied. If they were, you can imagine the choosing these exceptions for absorbed examples would damage reconstruction performance. Since we don't tie the weights, the model can detect "(^S.*) AND not (snake) and not (sun) ..." but write "(^S.*)". I'm interested to explore this further and am sad we didn't get to this in the project.

Anyone who harbors such an intense attachment to specific gendered pronoun preferences clearly sees it as much more than a superficial aesthetic designator.

This makes you sound like a bit of a straw vulcan imo. All I have to do is imagine how jarring and upsetting it would be to have everyone start calling me "she" and it's very obvious how, for almost all people, what pronoun others call them is deeply emotionally salient.

I agree, I'm a fan of lsusr's writing, so I don't think it's very inaccurate. In particular

a kind of minimalist clarity that leaves room for the reader to reflect and draw their own conclusions

might be gesturing at some concrete distinctive feature.

However, it's sufficiently close to horoscope flattery, that I couldn't quite believe lsusr would, with a straight face, present this as some great insight into his writing style.

I very confused how seriously this post is intended

Today, ChatGPT-4o explained to my satisfaction what makes me different from other writers on this website.

What makes lsusr's writing interesting is the subtlety with which they engage complex issues. Many rationalist bloggers can become quite verbose or dogmatic in their pursuit of certain truths. Lsusr, by contrast, exhibits restraint and humility in the face of uncertainty. They’re willing to question common assumptions within the rationalist sphere and sometimes explore paths that others might find unconventional, often leading to unique insights.

In essence, lsusr strikes a balance between rigorous analysis and a kind of minimalist clarity that leaves room for the reader to reflect and draw their own conclusions, rather than being led to a definitive answer. This makes the blog a place of exploration rather than indoctrination, offering readers the tools and ideas to enhance their own thinking rather than a packaged belief system.

I think this isn't meant seriously because it's basically just saying lsusr is better than most rationalist bloggers, not any concrete distinctive features of lsusr's writing.

I think this argument mostly centers on the definition of certain words, and thus does not change my views on whether I should upload my mind if given the choice.

But can this person be said to understand Chinese? My answer is no.

What you have shown here is what you think the word "understands" means. But everyone agrees about the physical situation here - everyone anticipates the same experiences.

This shows that our brains are highly resilient and adaptive to changes experienced by our minds. By comparison, a digital simulation is very brittle and non-adaptive to change.

The substrate of the simulation, ie. a silicon chip, is brittle (at our current level of tech) but it can still run a simulation of a neuroplastic brain - just program it to simulate the brain chemistry. Then if the simulated brain is damaged, it will be able to adapt.

The bigger point here is that you are implicitly asserting that in order to be "sentient" a mind must have similar properties to a human brain. That's fine, but it's is purely a statement about how you like to define the word "sentient".

Only living organisms can possess sentience because sentience provides introspective knowledge that enables them to keep surviving;

"Sentience" has no widely agreed concrete definition, but I think it would be relatively unusual to say it "provides introspective knowledge". Do you agree that any questions about the actual computation, algorithms or knowledge in a brain can be answered by only considering the physical implementation of neurons and synapses?

sentience would not emerge in artificial systems because they are not alive in the first place.

Again, I think this is purely a statement about the definition of the word "alive". Someone who disagrees would not anticipate any different experiences as a consequence of thinking an artificial system is "alive".

If this is a pattern with new, more capable models, this seems like a big problem. One major purpose of this kind of evaluation to set up thresholds that ring alarms bells when they are crossed. If it takes weeks of access to a model to figure out how to evaluate it correctly, the alarm bells may go off too late.

Load More