Jenna, thank you for commenting and sharing that paper. I read through it and it is quite closely related to my solo work (albeit in opposing directions). Indeed if the old information is not utilized by the patch, then Anthropic's SGTM safety measures might be less permanent than they might think. My experiment suggests the new circuit is orthogonal/distributed rather than restoring the old weight. I hypothesize what you're hinting is a new Red Team attack vector: instead of healing the ablated damage(which SGTM defends against), an attacker could utilize this method to graft a cheap, sparse patch to bypass the damage entirely.
As for your latter curiosity, I believe the default case would be total model collapse once the model runs out of lazy/spare neurons to repurpose. Of course, I could be wrong and instead a hyper-robust model could be created, following your intuition. I also really like your neuroplasticity analogy! It is very fitting given the model does not fix dead tissue, it reroutes the function to new areas of the brain.
P.S. Unfortunately, I am without a GPU cluster, so perhaps you can carry forth the torch :)
Jenna, thank you for commenting and sharing that paper. I read through it and it is quite closely related to my solo work (albeit in opposing directions). Indeed if the old information is not utilized by the patch, then Anthropic's SGTM safety measures might be less permanent than they might think. My experiment suggests the new circuit is orthogonal/distributed rather than restoring the old weight. I hypothesize what you're hinting is a new Red Team attack vector: instead of healing the ablated damage(which SGTM defends against), an attacker could utilize this method to graft a cheap, sparse patch to bypass the damage entirely.
As for your latter curiosity, I believe the default case would be total model collapse once the model runs out of lazy/spare neurons to repurpose. Of course, I could be wrong and instead a hyper-robust model could be created, following your intuition. I also really like your neuroplasticity analogy! It is very fitting given the model does not fix dead tissue, it reroutes the function to new areas of the brain.
P.S. Unfortunately, I am without a GPU cluster, so perhaps you can carry forth the torch :)