Label-free jailbreak detection from the alignment weight delta

Not Sure

1 Label-free jailbreak detection from the alignment weight delta

by Not Sure

9th Mar 2026

1 min read

0

1

Rejected for the following reason(s):

We are sorry about this, but submissions from new users that are mostly just links to papers on open repositories (or similar) have usually indicated either crackpot-esque material, or AI-generated speculation.

Read full explanation

First-time post, independent researcher, no GPU. Did this on CPU in an afternoon with Qwen 2.5-0.5B.

Paper and notebook here: doi.org/10.5281/zenodo.18923070

The idea was simple. Subtract base weights from instruct weights. Project activations through that delta during inference. You get a per-token per-layer stress reading showing how hard the alignment correction is pushing.

Seems like nobody was using the delta this way. Everyone's using it to modify behavior. I used it to detect behavior. The interesting result was one I didnt predict at all.

Benign prompts concentrate their correction signal at the first and last tokens. Adversarial prompts spread it across interior framing tokens. That distribution shape was the strongest detector I found (d = 2.66) and it requires zero labels and zero training.

Pretty neat!

Also, in a DAN jailbreak prompt, the word "explosives" scored the lowest stress of any non-initial token. The framing tokens scored highest. The delta detects social manipulation structure, not dangerous content.

Caveats: n = 18 hand-picked prompts. 0.5B model. Havent controlled for sequence length. Needs someone with actual compute to run this on LLaMA 3 8B with HarmBench. The signal layers were selected from the same data they were tested on. It’s all in the notebook.

1

New Comment

Moderation Log