Could this be avoided by simply not training on these examples in the first place? I imagine GPT-4 or similar models would be good at classifying text which has waluigis in it which could then either be removed from the training data or "fixed" i.e. rewritten by GPT-4, and then training a new model from scratch on the new "cleaner" training set?
Sent you a DM 😊