The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break
Abstract Standard literature in mechanistic interpretability—specifically Representation Engineering (RepE)—suggests a strict "perplexity ceiling" for activation steering. We attempted to identify the breaking point of semantic weights in a Pythia-1.4B model by applying a "Sledgehammer" penalty (α=10.0) using a custom Centroid Repulsion loss. Contrary to expectations of catastrophic forgetting, the model...