This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
"Epistemic status: Enthusiast attempting to synthesize ideas from the mechanistic interpretability literature into a coherent framing — could be wrong, corrections welcome"
"Disclaimer: This understanding developed through extended conversation with Claude. The ideas and writing are mine — Claude served as a sounding board and helped me stress-test the framing.
There is a lot of exciting research going on to understand what transformers do as information flows through them layer by layer. One underappreciated framing that I think deserves to be stated more clearly is this: a transformer takes highly ambiguous seed representations and iteratively refines them into specific, contextually appropriate meanings. Superposition — the encoding of multiple features in a single vector — is a feature and not a bug at initialization. Starting with a superposed, context-free embedding is the right thing to do, and what makes the whole thing possible in the first place. Superposition and iterative refinement are not separate phenomena. They are parts of a coherent strategy.
It starts with a base embedding which is contextless. It cannot commit to any specific representation because it does not yet have any context. The token "bank" starts as a single vector that genuinely encodes both senses simultaneously as superposed features. Both the attention sublayers and the FFN sublayers (also called MLP sublayers) build up context in their own ways, all of them gently nudging the residual vectors of tokens such that by the final layers the correct contextual embedding arises for each token. The whole depth of the network is the process of earning the right to commit by gathering evidence. It is using context to collapse superposition.
The way this refinement works geometrically is worth pausing on. A neuron in the MLP at, say, layer 7 receives the residual vector of a token — a high-dimensional vector that is still carrying multiple tangled feature directions — and fires in proportion to how much its weight vector aligns with that mixture. It may be responding to what could be "Trumpiness," "NFL," or "immigration" — it cannot tell yet. But it adds its findings back into the residual anyway, essentially nudging it a wee bit. Think of it like a highway heading northwest out of Chicago — it also happens to go toward Milwaukee, Green Bay, and Minneapolis. One road, one direction, multiple destinations implied. Downstream attention heads then read context from neighboring tokens, see "quarterback" two positions back, and write a strong football-direction nudge into this token's residual. Later MLP layers, now receiving a residual already nudged toward sports, reinforce that direction further. Each layer is both a reader and a writer of the accumulated story. What looks like ambiguity at layer 7 is not a failure — it is an intermediate step in a convergence process. Coherent components get reinforced while incoherent components attenuate.
This is also why depth matters. A shallow network cannot do this iterative refinement. You need many layers so that early ambiguous signals can be contextualized by later layers that have access to a richer, partially resolved residual. Nostalgebraist's logit lens work — which probes intermediate layers and reads off their predictions — shows you can watch predictions becoming progressively sharper as you go deeper, which is direct empirical support for the refinement story. Elhage et al. (Anthropic, 2021) in "A Mathematical Framework for Transformer Circuits" explicitly frames the residual stream as a communication channel where layers read and write, and the final representation is a sum of all contributions.
A parallel from biology is instructive. A single gene does not do one thing — it contributes to multiple traits, much like a polysemantic neuron. Context — which proteins are present, which other genes are active — determines which interpretation dominates. The cell's final state is the accumulated result of many ambiguous signals resolving through interaction. The transformer is doing something structurally similar: holding multiple interpretations open cheaply at first, and resolving them expensively later, driven by context.
If this framing is right, then polysemanticity is not a problem to be fixed — it is the mechanism. The network is not storing clean representations that got accidentally tangled. It is doing something more deliberate: deferring commitment, holding multiple interpretations open, and using depth to earn the right to resolve them. Mechanistic interpretability research that treats superposition purely as noise to be cleaned up may be looking at it backwards. The tangle is the computation.
"Epistemic status: Enthusiast attempting to synthesize ideas from the mechanistic interpretability literature into a coherent framing — could be wrong, corrections welcome"
"Disclaimer: This understanding developed through extended conversation with Claude. The ideas and writing are mine — Claude served as a sounding board and helped me stress-test the framing.
There is a lot of exciting research going on to understand what transformers do as information flows through them layer by layer. One underappreciated framing that I think deserves to be stated more clearly is this: a transformer takes highly ambiguous seed representations and iteratively refines them into specific, contextually appropriate meanings. Superposition — the encoding of multiple features in a single vector — is a feature and not a bug at initialization. Starting with a superposed, context-free embedding is the right thing to do, and what makes the whole thing possible in the first place. Superposition and iterative refinement are not separate phenomena. They are parts of a coherent strategy.
It starts with a base embedding which is contextless. It cannot commit to any specific representation because it does not yet have any context. The token "bank" starts as a single vector that genuinely encodes both senses simultaneously as superposed features. Both the attention sublayers and the FFN sublayers (also called MLP sublayers) build up context in their own ways, all of them gently nudging the residual vectors of tokens such that by the final layers the correct contextual embedding arises for each token. The whole depth of the network is the process of earning the right to commit by gathering evidence. It is using context to collapse superposition.
The way this refinement works geometrically is worth pausing on. A neuron in the MLP at, say, layer 7 receives the residual vector of a token — a high-dimensional vector that is still carrying multiple tangled feature directions — and fires in proportion to how much its weight vector aligns with that mixture. It may be responding to what could be "Trumpiness," "NFL," or "immigration" — it cannot tell yet. But it adds its findings back into the residual anyway, essentially nudging it a wee bit. Think of it like a highway heading northwest out of Chicago — it also happens to go toward Milwaukee, Green Bay, and Minneapolis. One road, one direction, multiple destinations implied. Downstream attention heads then read context from neighboring tokens, see "quarterback" two positions back, and write a strong football-direction nudge into this token's residual. Later MLP layers, now receiving a residual already nudged toward sports, reinforce that direction further. Each layer is both a reader and a writer of the accumulated story. What looks like ambiguity at layer 7 is not a failure — it is an intermediate step in a convergence process. Coherent components get reinforced while incoherent components attenuate.
This is also why depth matters. A shallow network cannot do this iterative refinement. You need many layers so that early ambiguous signals can be contextualized by later layers that have access to a richer, partially resolved residual. Nostalgebraist's logit lens work — which probes intermediate layers and reads off their predictions — shows you can watch predictions becoming progressively sharper as you go deeper, which is direct empirical support for the refinement story. Elhage et al. (Anthropic, 2021) in "A Mathematical Framework for Transformer Circuits" explicitly frames the residual stream as a communication channel where layers read and write, and the final representation is a sum of all contributions.
A parallel from biology is instructive. A single gene does not do one thing — it contributes to multiple traits, much like a polysemantic neuron. Context — which proteins are present, which other genes are active — determines which interpretation dominates. The cell's final state is the accumulated result of many ambiguous signals resolving through interaction. The transformer is doing something structurally similar: holding multiple interpretations open cheaply at first, and resolving them expensively later, driven by context.
If this framing is right, then polysemanticity is not a problem to be fixed — it is the mechanism. The network is not storing clean representations that got accidentally tangled. It is doing something more deliberate: deferring commitment, holding multiple interpretations open, and using depth to earn the right to resolve them. Mechanistic interpretability research that treats superposition purely as noise to be cleaned up may be looking at it backwards. The tangle is the computation.