I was watching an advocate of neuralese looping for chain of thought reasoning in models using the Iranian concept of tarouf as an example of a concept which English doesn't have a word for and must describe using a longer sequence of other descriptive words. And it made me wonder if the way that we are using polysemanticity would mean that all of the words that English would need to describe precisely what "tarouf" is would perhaps be seen as polysemantic concepts to a native Persian speaker under our current usage of "polysemantic."[1]
Think about anthropic's famous Golden gate bridge feature which was built using an auto encoder that was likely triggering off of several actual neurons in Claude3’s model. But even in English, is the concept of the Golden Gate Bridge built itself out of polysemantic smaller concepts? Concepts like a suspension bridge, being in San Francisco, an orange thing, a steel structure, etc. Even the name is built out of other words that have distinct meanings: golden, gate, bridge.
Think about the word gate. There are many different gates in the world (at least one of which is unfortunately embroiled in controversy at present). There are several gates around the residences of most people. Imagine if a culture had a distinct concept for a pedestrian gate versus a vehicle gate. Would the English word “gate” appear to be polysemantic to a person from such a culture? There still is an identifiable core of the concept which centers all of these applications.
When we imagine the polysemanticity of LLM neurons you probably think about something like a neuron which will trigger off of unrelated concepts like a neuron that triggers on airplane parts and phrases signifying periodic repetition. In my exploration of gpt2 -xl, I haven't really encountered neurons that are polysemantic in that way. Many times neurons signify a somewhat abstract concept which has many different instantiations in text: concluding remarks in a sales email; contemporary sci-fi adjacent fears like 5G conspiracies; a group of performers; etc[2]. While abstract these concepts do seem unitary and centralized around an intuitively identifiable and recognizable individual concept. I'm not entirely sure these neurons represent what we typically conceive of as polysemanticity.
Specific proper nouns are likely built out of combinations of these more primary ideas. Is “Klezmer” an individual neuron in most people's heads or is it something built out of ideas like “traditional music,” “Jewish,” “clarinet-ish” on the fly for most people when they encounter that set of tokens.
It seems to me that in many cases what we have been calling polysemanticity in LLM research refers more to a palette of coherent unitary primary ideas which are recombined to describe a broader range of concepts rather than neurons which through the vicissitudes of statistical randomness combine several fundamentally incoherent and distinct concepts.
Perhaps it is better to look at individual neurons as components of a palette of primary ideas which are combined into a broader array of secondary concepts.
Is this what we mean by polysemanticity? I'm not sure how useful this concept would be. All of language seems to be used in such a way that it is constantly taking small components, literally words, and combining them into larger objects which represent more complex but specific ideas. Nearly every sentence, every paragraph, every essay is conveying some idea which theoretically could be turned into a single lexeme. Insert a joke about German here.
These issues are not unknown to linguists (who more often use the word “polysemic” rather than “polysemantic” as is the fashion in LLM interpretability). Work like Youn et al. (2015) tries to find a universal underlying conceptual space to find natural primary concepts that have words from many languages clustered around them. Ironically modern multilingual LLM systems may represent an objectively statistical partitioning of conceptual space that could be seen as solving this linguistic problem, but this would require LLM neurons to be seen as definitively not polysemantic. Literally defining the partitioning of unitary primary concepts by the results of the LLM training process.
So I would suggest that perhaps instead of the language of polysemanticity being used in interpretability research we instead use compositional or agglutinative language at least most of the time; most colloquial concepts or features are “secondary” and built out of combinations of “primary” concepts or features. This probably more accurately describes the construction from component ideas that happens for many concepts represented by LLMs.
Youn, H., Sutton, L., Smith, E., Moore, C., Wilkins, J. F., Maddieson, I., Croft, W., & Bhattacharya, T. (2015). On the universal structure of human lexical semantics. PNAS, 113(7), 1766–1771.
These ideas seem to intersect with LawrenceC's recent article questioning the definition of "feature." As I discuss later in this post linguists have long struggled with finding an objective or neutral method to partition conceptual space; these seem to be similar issues to the uncertainty around what constitutes a "feature" which is discussed in LawrenceC's post.
A good representative of the "polysematicity" I most commonly see is the food-porn neuron I discussed last week. This neuron triggers on segments of text from erotic stories and cooking instructions, which are decently unrelated and distinct concepts. But even here they are intuitively united by a generalized hedonic theme and can be seen as not entirely distinct ideas. I noted that the two ideas seemed to connect on the word "cream."
I was watching an advocate of neuralese looping for chain of thought reasoning in models using the Iranian concept of tarouf as an example of a concept which English doesn't have a word for and must describe using a longer sequence of other descriptive words. And it made me wonder if the way that we are using polysemanticity would mean that all of the words that English would need to describe precisely what "tarouf" is would perhaps be seen as polysemantic concepts to a native Persian speaker under our current usage of "polysemantic."[1]
Think about anthropic's famous Golden gate bridge feature which was built using an auto encoder that was likely triggering off of several actual neurons in Claude3’s model. But even in English, is the concept of the Golden Gate Bridge built itself out of polysemantic smaller concepts? Concepts like a suspension bridge, being in San Francisco, an orange thing, a steel structure, etc. Even the name is built out of other words that have distinct meanings: golden, gate, bridge.
Think about the word gate. There are many different gates in the world (at least one of which is unfortunately embroiled in controversy at present). There are several gates around the residences of most people. Imagine if a culture had a distinct concept for a pedestrian gate versus a vehicle gate. Would the English word “gate” appear to be polysemantic to a person from such a culture? There still is an identifiable core of the concept which centers all of these applications.
When we imagine the polysemanticity of LLM neurons you probably think about something like a neuron which will trigger off of unrelated concepts like a neuron that triggers on airplane parts and phrases signifying periodic repetition. In my exploration of gpt2 -xl, I haven't really encountered neurons that are polysemantic in that way. Many times neurons signify a somewhat abstract concept which has many different instantiations in text: concluding remarks in a sales email; contemporary sci-fi adjacent fears like 5G conspiracies; a group of performers; etc[2]. While abstract these concepts do seem unitary and centralized around an intuitively identifiable and recognizable individual concept. I'm not entirely sure these neurons represent what we typically conceive of as polysemanticity.
Specific proper nouns are likely built out of combinations of these more primary ideas. Is “Klezmer” an individual neuron in most people's heads or is it something built out of ideas like “traditional music,” “Jewish,” “clarinet-ish” on the fly for most people when they encounter that set of tokens.
It seems to me that in many cases what we have been calling polysemanticity in LLM research refers more to a palette of coherent unitary primary ideas which are recombined to describe a broader range of concepts rather than neurons which through the vicissitudes of statistical randomness combine several fundamentally incoherent and distinct concepts.
Is this what we mean by polysemanticity? I'm not sure how useful this concept would be. All of language seems to be used in such a way that it is constantly taking small components, literally words, and combining them into larger objects which represent more complex but specific ideas. Nearly every sentence, every paragraph, every essay is conveying some idea which theoretically could be turned into a single lexeme. Insert a joke about German here.
These issues are not unknown to linguists (who more often use the word “polysemic” rather than “polysemantic” as is the fashion in LLM interpretability). Work like Youn et al. (2015) tries to find a universal underlying conceptual space to find natural primary concepts that have words from many languages clustered around them. Ironically modern multilingual LLM systems may represent an objectively statistical partitioning of conceptual space that could be seen as solving this linguistic problem, but this would require LLM neurons to be seen as definitively not polysemantic. Literally defining the partitioning of unitary primary concepts by the results of the LLM training process.
So I would suggest that perhaps instead of the language of polysemanticity being used in interpretability research we instead use compositional or agglutinative language at least most of the time; most colloquial concepts or features are “secondary” and built out of combinations of “primary” concepts or features. This probably more accurately describes the construction from component ideas that happens for many concepts represented by LLMs.
Youn, H., Sutton, L., Smith, E., Moore, C., Wilkins, J. F., Maddieson, I., Croft, W., & Bhattacharya, T. (2015). On the universal structure of human lexical semantics. PNAS, 113(7), 1766–1771.
These ideas seem to intersect with LawrenceC's recent article questioning the definition of "feature." As I discuss later in this post linguists have long struggled with finding an objective or neutral method to partition conceptual space; these seem to be similar issues to the uncertainty around what constitutes a "feature" which is discussed in LawrenceC's post.
A good representative of the "polysematicity" I most commonly see is the food-porn neuron I discussed last week. This neuron triggers on segments of text from erotic stories and cooking instructions, which are decently unrelated and distinct concepts. But even here they are intuitively united by a generalized hedonic theme and can be seen as not entirely distinct ideas. I noted that the two ideas seemed to connect on the word "cream."