Hey! Apologies for the late reply. I'm Arka, one of the collaborators on this project and thank you for the feedback.
To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model's weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its ... (read more)
Hey! Apologies for the late reply. I'm Arka, one of the collaborators on this project and thank you for the feedback.
To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model's weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its ... (read more)