Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms
Intro: I made a small adapter (~4.7M parameters) that sits on top of a frozen Phi-2 model and forces it through two mathematically opposing attention mechanisms. The result initially was that it generalizes past its sparse training, sometimes into surprising domains. After finding out precision significantly impacts scores and metrics...
Apr 814