Superweight Surgery: Repairing "Brain Damage" in OLMo-1B with a Single Row Patch

sunmoonron

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Code: https://github.com/sunmoonron/super-weight-circuit-patching

TL;DR: I reproduced the "Superweight" failure mode in OLMo-1B (where deleting one weight causes catastrophic collapse). Then, I attempted to repair the model using a tiny, rank-1 row patch trained on a CPU. The patch recovered around 93% of the lost performance, but interestingly, it did not just relearn the original weight! Instead, it learned a new, distributed circuit orthogonal to the original.

Motivation

I was inspired by two recent results:

Apple's "The Super Weight in Large Language Models": Showing that ablating a single scalar can break a model.
OpenAI’s "Weight-sparse transformers have interpretable circuits": Showing that behaviours can be implemented via clean, sparse circuits.

My main desire was to bridge these ideas with a pragmatic question: If I lobotomize a 1B model by deleting a known "superweight", can I repair the damage by training only a small local patch, and, what does that patch look like geometrically?

The Procedure

The Patient: allenai's OLMo-1B-0724-hf

The Injury: Grabbing the "superweight" from the LLMSuperWeight repository, I zeroed out a single scalar in the early MLP down-projection: model.layers.1.mlp.down_proj.weight[1764, 1710]

The result was immediate and catastrophic. On a Wikitext-2 slice:

Perplexity: Exploded from 17.4 to 2884.3.
Behaviour: The model fell into pathological repetition loops and bizarre hallucinations, as proven by Apple's paper.

Prompts & Outputs

Prompt: "Paris is in France. Tokyo is in".

Output: "Paris is in France. Tokyo is in the ocean. The ocean. The ocean. the. the...".

Prompt: "The capital of Canada is".

Output: "The capital of Canada is a lot of people. leaves of of of of...".

The Surgery: I froze the entire model and introduced a single trainable vector, Δrow, added to row 1764 of the down-projection matrix.

I trained this patch for just 400 steps (batch size 2) on an Intel 16GB MacBook CPU, distilling logits from the frozen BASE model.

Results: 93% Recovery

The repair was surprisingly effective. Treating the Negative Log-Likelihood increase as "damage," the row-level patch recovered 92.7% of the capability.

Model	NLL	PPL	KL(BASE\|\|Model)
BASE	2.86	17.4	0
BROKEN	7.97	2884.3	5.03
PATCHED	3.23	25.2	0.37

Qualitatively, the pathological loops vanished.

Broken: "Paris is in France. Tokyo is in the ocean. The ocean..."
Patched: "Paris is in France. Tokyo is in Japan. Answer: Tokyo is in Japan."

Example Outputs

Mechanistic Analysis: It didn't just put the weight back

This is the most interesting part. You might assume the model simply learned to set the index [1710] in Δrow back to its original value. However, it did not.

I compared the learned patch vector to the original base row:

Cosine Similarity (Base, Delta): ≈0.13.
Norm: The patch vector has a norm of ≈3.01, nearly 3× larger than the original row(≈1.14).

The patch acts as a distributed circuit. It spreads the "repair work" across tens of non-zero entries rather than a single magic scalar. When I tried sparsifying the patch (i.e keeping only Top-16 entries), performance degraded significantly, suggesting the repair really does rely on this distributed direction.

Hypothesis: The "Marine Biology" Neuron?

When analyzing which tokens triggered the highest KL divergence in the broken model, I noticed a strange cluster of concepts:

American lobster, North Sea, European species, crab, molluscs.

This aligns with the broken model's tendency to hallucinate "mar, mar, mar" (marine?). My tentative hypothesis is that this superweight is crucial for a specific "marine/coastal" feature direction. When removed, the model's ontology for these concepts collapses, leaking into general "junk" tokens. The distributed patch seems to construct a new pathway to handle this feature.

Conclusion

This project was a "proof of concept" conducted in approximately 16 hours on a CPU. It suggests that LLMs are robust enough to reroute "brain damage" through alternative, distributed circuits using very cheap, localized editing.

*Author's Note: I conducted this research independently while applying for the MATS program. I am currently open to Research Engineering roles as a recent graduate; if you're looking for someone who enjoys digging into model internals (even without a GPU), feel free to reach out!*

LESSWRONG
LW