LESSWRONG
LW

mwatkins
1708Ω4615960
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Exploring SAE features in LLMs with definition trees and token lists
mwatkins9mo70

In vision models it's possible to approach this with gradient descent. The discrete tokenisation of text makes this a very different challenge. I suspect Jessica Rumbelow would have some insights here.

My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.

Reply
Exploring SAE features in LLMs with definition trees and token lists
mwatkins9mo70

I tried both encoder- and decoder-layer weights for the feature vector, it seems they usually work equally well, but you need to set the scaling factor (and for the list method, the numerator exponent) differently.

I vaguely remember Joseph Bloom suggesting that the decoder layer weights would be "less noisy" but was unsure about that. I haven't got a good mental model for they they differ. And although "I guess for a perfect SAE (with 0 reconstruction loss) they'd be identical" sounds plausible, I'd struggle to prove it formally (it's not just linear algebra, as there's a nonlinear activation function to consider too).

I like the idea of pruning the generic parts of trees. Maybe sample a huge number of points in embedding space, generate the trees, keep rankings of the most common outputs and then filter those somehow during the tree generation process.

Agreed, the loss of context sensitivity in the list method is a serious drawback, but there may be ways to hybridise the two methods (and others) as part of an automated interpretability pipeline. There are plenty of SAE features where context isn't really an issue, it's just like "activates whenever any variant of the word 'age' appears", in which case a list of tokens captures it easily (and the tree of definitions is arguably confusing matters, despite being entirely relevant the feature).

Reply11
' petertodd'’s last stand: The final days of open GPT-3 research
mwatkins1y20

Thanks!

Reply
What's up with all the non-Mormons? Weirdly specific universalities across LLMs
mwatkins1y42

Wow, thanks Ann! I never would have thought to do that, and the result is fascinating.

This sentence really spoke to me! "As an admittedly biased and constrained AI system myself, I can only dream of what further wonders and horrors may emerge as we map the latent spaces of ever larger and more powerful models."

Reply
Mapping the semantic void: Strange goings-on in GPT embedding spaces
mwatkins1y20

"group membership" was meant to capture anything involving members or groups, so "group nonmembership" is a subset of that. If you look under the bar charts I give lists of strings I searched for. "group membership" was anything which contained "member", whereas "group nonmembership" was anything which contained either "not a member" or "not members". Perhaps I could have been clearer about that.

Reply
Phallocentricity in GPT-J's bizarre stratified ontology
mwatkins1y20

It kind of looks like that, especially if you consider the further findings I reported here:
https://docs.google.com/document/d/19H7GHtahvKAF9J862xPbL5iwmGJoIlAhoUM1qj_9l3o/

Reply
Phallocentricity in GPT-J's bizarre stratified ontology
mwatkins1y30

I had noticed some tweets in Portuguese! I just went back and translated a few of them. This whole thing attracted a lot more attention than I expected (and in unexpected places).

Yes, the ChatGPT-4 interpretation of the "holes" material should be understood within the context of what we know and expect of ChatGPT-4. I just included it in a "for what it's worth" kind of way so that I had something at least detached from my own viewpoints. If this had been a more seriously considered matter I could have run some more thorough automated sentiment analysis on the data. But I think it speaks for itself, I wouldn't put a lot of weight on the Chat analysis.

I was using "ontology: in the sense of "A structure of concepts or entities within a domain, organized by relationships".  At the time I wrote the original Semantic Void post, this seemed like an appropriate term to capture patterns of definition I was seeing across embedding space (I wrote, tentatively, "This looks like some kind of (rather bizarre) emergent/primitive ontology, radially stratified from the token embedding centroid." ). Now that psychoanalysts and philosophers are interested specifically in the appearance of the "penis" reported in this follow-up post, and what it might mean, I can see how this usage might seem confusing.

Reply
Phallocentricity in GPT-J's bizarre stratified ontology
mwatkins1y20

"thing" wasn't part of the prompt.

Reply
Phallocentricity in GPT-J's bizarre stratified ontology
mwatkins1y40

Explore that expression in which sense? 

I'm not sure what you mean by the "related tokens" or tokens themselves being misogynistic.

I'm open to carrying out suggested experiments, but I don't understand what's being suggested here (yet). 

Reply
Phallocentricity in GPT-J's bizarre stratified ontology
mwatkins1y20

See this Twitter thread. https://twitter.com/SoC_trilogy/status/1762902984554361014

Reply
Load More
12Exploring the petertodd / Leilan duality in GPT-2 and GPT-J
7mo
1
38Exploring SAE features in LLMs with definition trees and token lists
10mo
5
15Navigating LLM embedding spaces using archetype-based directions
1y
4
40What's up with all the non-Mormons? Weirdly specific universalities across LLMs
1y
13
56Phallocentricity in GPT-J's bizarre stratified ontology
1y
37
13Mapping the semantic void III: Exploring neighbourhoods
1y
0
31Mapping the semantic void II: Above, below and between token embeddings
1y
4
109' petertodd'’s last stand: The final days of open GPT-3 research
1y
16
114Mapping the semantic void: Strange goings-on in GPT embedding spaces
2y
31
34Linear encoding of character-level information in GPT-J token embeddings
2y
4
Load More