Label-free jailbreak detection from the alignment weight delta
First-time post, independent researcher, no GPU. Did this on CPU in an afternoon with Qwen 2.5-0.5B. Paper and notebook here: doi.org/10.5281/zenodo.18923070 The idea was simple. Subtract base weights from instruct weights. Project activations through that delta during inference. You get a per-token per-layer stress reading showing how hard the alignment...