This post was rejected for the following reason(s):
This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work. An LLM-detection service flagged your post as >50% likely to be written by an LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.
So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*
"English is my second language, I'm using this to translate"
If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly.
"What if I think this was a mistake?"
For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.
you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
If any of those are false, sorry, we will not accept your post.
* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)
TL;DR: I reproduced the "Superweight" failure mode in OLMo-1B (where deleting one weight causes catastrophic collapse). Then, I attempted to repair the model using a tiny, rank-1 row patch trained on a CPU. The patch recovered around 93% of the lost performance, but interestingly, it did not just relearn the original weight! Instead, it learned a new, distributed circuit orthogonal to the original.
My main desire was to bridge these ideas with a pragmatic question: If I lobotomize a 1B model by deleting a known "superweight", can I repair the damage by training only a small local patch, and, what does that patch look like geometrically?
The Procedure
The Patient: allenai's OLMo-1B-0724-hf
The Injury: Grabbing the "superweight" from the LLMSuperWeight repository, I zeroed out a single scalar in the early MLP down-projection: model.layers.1.mlp.down_proj.weight[1764, 1710]
The result was immediate and catastrophic. On a Wikitext-2 slice:
Perplexity: Exploded from 17.4 to 2884.3.
Behaviour: The model fell into pathological repetition loops and bizarre hallucinations, as proven by Apple's paper.
Prompts & Outputs
Prompt: "Paris is in France. Tokyo is in".
Output: "Paris is in France. Tokyo is in the ocean. The ocean. The ocean. the. the...".
Prompt: "The capital of Canada is".
Output: "The capital of Canada is a lot of people. leaves of of of of...".
The Surgery: I froze the entire model and introduced a single trainable vector, Δrow, added to row 1764 of the down-projection matrix.
W′=Wbroken+erow⊗Δrow
I trained this patch for just 400 steps (batch size 2) on an Intel 16GB MacBook CPU, distilling logits from the frozen BASE model.
Results: 93% Recovery
The repair was surprisingly effective. Treating the Negative Log-Likelihood increase as "damage," the row-level patch recovered 92.7% of the capability.
Model
NLL
PPL
KL(BASE||Model)
BASE
2.86
17.4
0
BROKEN
7.97
2884.3
5.03
PATCHED
3.23
25.2
0.37
Qualitatively, the pathological loops vanished.
Broken: "Paris is in France. Tokyo is in the ocean. The ocean..."
Patched: "Paris is in France. Tokyo is in Japan. Answer: Tokyo is in Japan."
Example Outputs
Mechanistic Analysis: It didn't just put the weight back
This is the most interesting part. You might assume the model simply learned to set the index [1710] in Δrow back to its original value. However, it did not.
I compared the learned patch vector to the original base row:
Cosine Similarity (Base, Delta): ≈0.13.
Norm: The patch vector has a norm of ≈3.01, nearly 3× larger than the original row(≈1.14).
The patch acts as a distributed circuit. It spreads the "repair work" across tens of non-zero entries rather than a single magic scalar. When I tried sparsifying the patch (i.e keeping only Top-16 entries), performance degraded significantly, suggesting the repair really does rely on this distributed direction.
Hypothesis: The "Marine Biology" Neuron?
When analyzing which tokens triggered the highest KL divergence in the broken model, I noticed a strange cluster of concepts:
American lobster, North Sea, European species, crab, molluscs.
This aligns with the broken model's tendency to hallucinate "mar, mar, mar" (marine?). My tentative hypothesis is that this superweight is crucial for a specific "marine/coastal" feature direction. When removed, the model's ontology for these concepts collapses, leaking into general "junk" tokens. The distributed patch seems to construct a new pathway to handle this feature.
Conclusion
This project was a "proof of concept" conducted in approximately 16 hours on a CPU. It suggests that LLMs are robust enough to reroute "brain damage" through alternative, distributed circuits using very cheap, localized editing.
*Author's Note: I conducted this research independently while applying for the MATS program. I am currently open to Research Engineering roles as a recent graduate; if you're looking for someone who enjoys digging into model internals (even without a GPU), feel free to reach out!*