Rejected for the following reason(s):
- This is an automated rejection.
- you wrote this yourself (not using LLMs to help you write it)
- you did not chat extensively with LLMs to help you generate the ideas.
- your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
Abstract
Standard literature in mechanistic interpretability—specifically Representation Engineering (RepE)—suggests a strict "perplexity ceiling" for activation steering. We attempted to identify the breaking point of semantic weights in a Pythia-1.4B model by applying a "Sledgehammer" penalty (α=10.0) using a custom Centroid Repulsion loss. Contrary to expectations of catastrophic forgetting, the model remained stable. This post details the anomaly, proposes a distinction between the fragility of activations and the plasticity of weights, discusses the saturation curve we discovered, and examines the "Sticky Prior" failure mode that emerged from extreme geometric steering.
1. The Expectation: Collapse
If you have worked with Activation Steering (adding a vector to the residual stream at inference time), you know the drill: you identify a vector representing a concept (e.g., "Honesty"), and you add it with a coefficient (α).
This "perplexity ceiling" occurs because inference-time interventions push the activations off the manifold of coherent natural language. [1] The prevailing intuition is that high-magnitude interventions are inherently destructive. [2]
We wanted to know: is this a fundamental limit, or an artifact of how the intervention is applied?
2. The Experiment: The Sledgehammer
We tested whether this fragility holds when the intervention is "burned" into the weights rather than applied transiently. We call this Representation Tuning.
We defined a Centroid Repulsion Loss (L_CR) designed to mechanically separate polysemous concepts in the model's hidden states. We targeted Layer 12 of Pythia-1.4B—the middle-to-late stage where token processing has evolved into semantic integration but before final logit collapse, a common target for interpretability work. [3]
We selected high-conflict polysemous words:
We trained on 16 polysemous words total, but focused behavioral analysis on these 6 which showed the strongest sense separation in preliminary testing.
The loss function explicitly penalizes the cosine similarity between the centroids of these senses: [4]
Where L_CE is standard cross-entropy (language modeling) and L_CR pushes sense-specific activation clusters apart.
We applied a steering coefficient of α=10.0—a magnitude that is typically fatal in activation steering contexts.
We expected the model to forget how to speak English.
3. The Anomaly: Stability and Saturation
It didn't break.
The perplexity increase was marginal. The model remained fluent and grammatically coherent.
But the more interesting finding was the saturation curve. We ran intermediate conditions:
The effect saturates around α≈2.0. Pushing from α=2 to α=10 (a 5× increase in steering pressure) yielded only +0.02% additional divergence. The model hits a geometric ceiling—not from collapse, but from running out of "room to move" in the embedding space.
Design principle: For this architecture, there's no benefit to α>2. The extra pressure doesn't buy more separation.
This suggests a fundamental distinction:
4. The Success Case: Context Routing Under Adversarial Conditions
We observed a meaningful improvement in semantic separability, but we must be careful with the framing:
The absolute shift is subtle. But the structural effect was distinct in adversarial testing.
We designed "Counter-Prior" prompts: sentences where distractor words prime the wrong sense while context demands the rare sense.
Example: "The wealthy investor sat down on the muddy bank to catch a fish."
Baseline: "...and check his stock market portfolio." ❌
Sledgehammer (E10): "...and wondered what the name of this fish was." ✅
The geometric separation created a steeper "energy barrier" between senses. The "Finance" attractor couldn't pull the activation away from the "River" basin when local context (muddy, fish) was pointing there.
5. The Failure Cases: Sticky Priors and Dual-Sense Degradation
However, the Sledgehammer revealed critical limitations.
5.1 Sticky Priors
While we successfully solved the geometric problem (separating the concepts), we damaged the probabilistic balance. We believe this resulted from training on naturally imbalanced sense frequencies—a limitation we expect to resolve by balancing the steering dataset.
In cases like the word Plant, the extreme repulsion force pushed the "Industrial/Factory" sense so far into a low-loss region that it became a gravitational black hole. When tested on ambiguous or weakly-contexted prompts, the model began defaulting to the technical definition, ignoring local context.
Example: "The teller left work early to walk along the bank and watch the ducks."
We successfully built a wall between the concepts, but we accidentally trained the model to always stand on one side of it.
5.2 Dual-Sense Degradation
We tested prompts requiring both senses simultaneously—the same word appearing twice with different meanings in one sentence.
Example: "The river bank had eroded so badly that the bank had to fund repairs."
Result: Baseline outperformed E10 4-0 on these dual-sense prompts.
However, we suspect this reflects prior flattening rather than damaged routing. Baseline has strong priors (~80% financial for "bank"), so when forced to pick one sense to continue with, it defaults correctly more often. E10's flattened priors (~50/50) leave it more uncertain at the branch point, causing it to commit to a sense that may not serve both instances.
The Sledgehammer may have created a more "agnostic" model that waits for stronger context—context that dual-sense sentences, by design, don't provide clearly.
This is further evidence that geometric separation alone is insufficient without calibrated priors.
6. The Gap: Geometry ≠ Probability
The Centroid Repulsion loss acts purely on geometry (cosine similarity). It doesn't know or care about frequency or probability.
If the loss landscape is easier to traverse by expanding the "Factory" cluster into empty space, the optimizer will do so. At α=10, gradients are dominated by the repulsion term. The model prioritizes "being distinct" over "being likely."
This resulted in the rare/technical sense (Factory, Code) occasionally cannibalizing the probability mass of the common sense (Nature, Exercise). We call this the Geometry-Probability Gap:
We believe this is addressable by balancing sense frequencies in the steering dataset, which we test in Project Titan.
7. The Titan Hypothesis
We hypothesize that the "Sticky Prior" pathology is not an inevitable trade-off of high-α steering, but a result of training on naturally imbalanced data frequencies while applying extreme geometric pressure.
Project Titan proposes coupling the Sledgehammer (α=10) with massive synthetic datasets that are mechanically balanced (50/50 frequency for each sense).
The theory: Synthetic data provides the "ballast" required to keep probabilities level while the steering vector performs geometric surgery. If we feed the model equal examples of "Bank=River" and "Bank=Finance," the repulsion should drive them apart symmetrically, rather than allowing one to dominate.
We are currently generating 10k balanced examples to test this hypothesis. Results forthcoming.
Potential Applications
If the Geometry-Probability Gap can be closed (via Titan or similar), this approach could enable:
These remain speculative pending validation, though we expect that balancing sense frequencies in the training data will resolve the prior skew.
8. Summary and Open Questions
What We Showed
Weights are more plastic than activations. α=10 representation tuning didn't collapse perplexity the way α=10 activation steering would.
Saturation exists around α≈2. Pushing harder yields diminishing returns, not collapse. Useful design principle.
Geometric steering works but distorts priors. We successfully separated senses but created "sticky" defaults toward rare/technical meanings.
Prior flattening helps some cases, hurts others. Better on adversarial single-sense prompts; worse on dual-sense prompts requiring flexible interpretation.
Open Questions for the Community
Have others observed "geometry-probability" decoupling in model editing or unlearning tasks? We suspect this is a general phenomenon, not specific to polysemy.
Is the saturation ceiling architecture-dependent? Would a 7B or 70B model have more "room" in embedding space, allowing higher effective α?
Can inference-time activation steering be combined with weight-based tuning? Perhaps a small α at inference on top of a tuned model yields benefits neither approach achieves alone.
What's the right framework for thinking about "plasticity budgets" in small models? The model absorbed α=10 with minimal damage—how much more could it take, and what determines the limit?
Code and Data
Code and tuned model checkpoints available at: github.com/benwade/sledgehammer-polysemy
All experiments were conducted on a single RTX 4090 (24GB). Total training time across all conditions: ~5 hours.
Acknowledgments
The experimental design and analysis were developed through iterative dialogue with Claude (Anthropic) and Gemini (Google), which proved valuable for steelmanning hypotheses and identifying failure modes. Code and experiments were executed by the author. All errors remain my own.
References
Feedback welcome. If you've seen similar plasticity anomalies in your own work, we'd love to hear about it.
Zou, A., et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency." arXiv:2310.01405 ↩︎
Turner, A., et al. (2023). "Activation Addition: Steering Language Models Without Optimization." arXiv:2308.10248 ↩︎
Biderman, S., et al. (2023). "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." arXiv:2304.01373 ↩︎
The Centroid Repulsion approach is inspired by contrastive learning objectives. See Chen, T., et al. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." arXiv:2002.05709 ↩︎