x

LESSWRONG

LW

Arka Dash — LessWrong

Arka Dash

Arka Dash

Message

1

1

2mo

Arka Dash

1

2mo

Token Statistics Fail at AI Attack Detection But Generation Profiles Succeed

Hey! Apologies for the late reply. I'm Arka, one of the collaborators on this project and thank you for the feedback.

To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model's weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its ... (read more)