x

LESSWRONG

LW

Methuselah — LessWrong

Methuselah

Methuselah

Message

17

1

2

4mo

Methuselah

17

4mo

Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms

Intro: I made a small adapter (~4.7M parameters) that sits on top of a frozen Phi-2 model and forces it through two mathematically opposing attention mechanisms. The result initially was that it generalizes past its sparse training, sometimes into surprising domains. After finding out precision significantly impacts scores and metrics...