Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment)
Motivation While lurking LessWrong, I read Apple's "The Super Weight in Large Language Models" paper and OpenAI's "Weight-sparse transformers have interpretable circuits" paper. My curiosity was simple, whether it is possible to bridge the core ideas derived from the two papers to explore a new direction, namely: If I destroy...
Jenna, thank you for commenting and sharing that paper. I read through it and it is quite closely related to my solo work (albeit in opposing directions). Indeed if the old information is not utilized by the patch, then Anthropic's SGTM safety measures might be less permanent than they might think. My experiment suggests the new circuit is orthogonal/distributed rather than restoring the old weight. I hypothesize what you're hinting is a new Red Team attack vector: instead of healing the ablated damage(which SGTM defends against), an attacker could utilize this method to graft a cheap, sparse patch to bypass the damage entirely.
As for your latter curiosity, I believe the default case... (read more)