LESSWRONG
LW

RGRGRG
66Ω13320
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Bridging the VLM and mech interp communities for multimodal interpretability
RGRGRG9mo10

Hi Sonia - Can you please explain what you mean by "mixed selectivity"; particularly, I don't understand what you mean by "Some of these studies then seem to conclude that SAEs alleviate superposition when really they may alleviate mixed selectivity."  Thanks.

Reply
StefanHex's Shortform
RGRGRG10mo10

I like this recent post about atomic meta-SAE features, I think these are much closer (compared against normal SAEs) to what I expect atomic units to look like:

https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes

Reply
Growth and Form in a Toy Model of Superposition
RGRGRG10mo10

Would you be willing to share the raw data from the plot - "Developmental Stages of TMS", I'm specifically hoping to look at line plots of weights vs biases over time.

Thanks.

Reply
There Should Be More Alignment-Driven Startups
RGRGRG1y10

What are the terms of the seed funding prize(s)?

Reply
Mechanistically Eliciting Latent Behaviors in Language Models
RGRGRG1y10

Enjoyed this post! Quick question about obtaining the steering vectors:

Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

Reply
Transcoders enable fine-grained interpretable circuit analysis for language models
RGRGRG1y10

Question about the "rules of the game" you present.  Are you allowed to simply look at layer 0 transcoder features for the final 10 tokens - you could probably roughly estimate the input string from these features' top activators.  From you case study, it seems that you effectively look at layer 0 transcoder features for a few of the final tokens through a backwards search, but wonder if you can skip the search and simply look at transcoder features.  Thank you.

Reply
Finding Sparse Linear Connections between Features in LLMs
RGRGRG2yΩ110

To confirm - the weights you share, such as 0.26 and 0.23 are each individual entries in the W matrix for:
y=Wx ?

Reply
Growth and Form in a Toy Model of Superposition
RGRGRG2y30

This is a casual thought and by no means something I've thought hard about - I'm curious whether b is a lagging indicator, which is to say, there's actually more magic going on in the weights and once weights go through this change, b catches up to it.

Another speculative thought, let's say we are moving from 4* -> 5* and |W_3| is the new W that is taking on high magnitude.  Does this occur because somehow W_3 has enough internal individual weights to jointly look at it's two (new) neighbors' W_i`s  roughly equally?

Does the cos similarity and/or dot product of this new W_3 with its neighbors grow during the 4* -> 5* transition (and does this occur prior to the change in b?)

Reply
Growth and Form in a Toy Model of Superposition
RGRGRG2y30

Question about the gif - to me it looks like the phase transition is more like:

4++-  to unstable 5+-  to 4+-  to  5-
(Unstable 5+- seems to have similar loss to 4+-).

Why do we not count the large red bar as a "-" ?

Reply
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
RGRGRG2y40

Do you expect similar results (besides the fact that it would take longer to train / cost more) without using LoRA?

Reply
Load More
3Seeking Feedback on My Mechanistic Interpretability Research Agenda
2y
1
24Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)
2y
5
9Best Ways to Try to Get Funding for Alignment Research?
Q
2y
Q
6