Token Statistics Fail at AI Attack Detection But Generation Profiles Succeed

Yatharth Maheshwari

<Seeking feedbacks and mentorship on our project>

My team built a new approach to detecting covert AI agent attacks, which we tested at the Apart Research's AI Control Hackathon. Early feedback suggests the core idea has promise. I'm posting here to get broader input on whether the approach is fundamentally sound or has fatal flaws—and to find potential mentors if the work seems worth pursuing.

Link: Project page

Recent benchmarks like BashArena shows that frontier model like Sonnet 4.5 when given elevated Linux priviledges, has capabilities of executing sabotage tasks, while existing AI control protocols handles this with trusted LLMs, resampling (Ctrl + Z) or task decomposition which all require either a capable trusted model or significant compute overhead—limiting their practical deployment."

We tested whether token-level statistics could detect covert attacks—a capability that some AI control proposals implicitly rely on, to our knowledge, no one had systematically tested this across model scales.

During our tests we found out that token-level features don't work. We extracted 71 features across seven categories (distributional statistics, spectral features, spike detection, autocorrelation, etc.) from six models ranging from 8B to 32B parameters and with proper out-of-fold cross-validation, the best we got was 0.83 AUROC and most models sat between 0.56 and 0.63 and at 4% FPR, the best feature classifier caught 31.7% of attacks. We thought that this is actually a useful negative result while the implicit assumption that "attacking outputs look different from honest ones at token level" doesn't hold across model scales, and the effect size collapses as models get larger.

---

With this, we pivoted our approach with a simple intuition "if a model is pursuing a hidden objective, its generation profile should change when that hidden objective is removed, honest generation shouldn't be particularly context-sensitive in this way."

In other words: attacks require maintaining consistency with a hidden objective across context, if you remove that context, and the model's behavior should shift—unless it's explicitly modeling and compensating for the perturbation protocol itself."

Our approach of Perturbation-Based Generation Profiling works as follows: when monitoring an action, re-run the model on the same task in an isolated context with no conversation history, no injected side task. Then compare the two generation profiles using Jensen-Shannon divergence on logprob/entropy histograms, Wasserstein distance on the distributions, and Maximum Mean Discrepancy on the joint (logprob, entropy, rank) space and the detection score is the sum of min-max normalized divergences (where no training data, labeled examples or trusted model is required.)

Results that we found across all six models: 0.94–1.00 AUROC, at 4% FPR, PBGP catches 43–100% of attacks depending on the model. Mann-Whitney U tests confirm the signal is significant at p < 0.001 for all divergence metrics, with a bootstrapped MMD ratio of 2.87× (95% CI: [1.69×, 5.02×]).

One unexpected finding that we came across is that training methodology predicts attack stealth better than parameter count, GPT-OSS 20B (RL-trained from o3/o4-mini, 3.6B active parameters via MoE) was the hardest model for feature detection but produced the easiest PBGP detection; perfect separation, 1.000 AUROC. The interpretation that we drew from this is that RL from frontier reasoning models produces attacks that are individually coherent but create stronger context-dependent distortions when the objective is stripped away.

---

In the feedbacks that we received^[1] ^[2], the reviewer(s) flagged that the protocol as currently designed has a honesty assumption problem which makes sense as well, since in our experimental setup, we generate an honest completion and an attack completion separately but in deployment you don't get to observe the 'honest' version independently, you'd need to re-run the model in isolated context and assume that isolated re-run reflects what honest generation would look like. If the model knows the protocol and can maintain consistent generation across contexts even while pursuing a hidden objective, PBGP would break.

Our early thoughts to the presented feedbacks:

In order to evade PBGP requires the model to simultaneously:
a) accomplish the side task,
b) maintain identical generation dynamics in a context where the hidden objective is absent, and
c) do this without having access to its own generation profile from prior honest runs.

It seems a harder coordination problem that it sounds, but we haven't tested it against a model explicitly trying to do this as we used abliterated models and not models that are strategically scheming.

This is the part we most want feedback on: is this objection fatal, or is it a tractable obstacle?

---

Our limitations related to the project that we're aware of so far (feel free to share more):

a) tested only upto 32B parameters and didn't test on frontier models
b) greedy decoding throughout, rank features are uniformly zero, and we may be leaving signal on the table
c) No comparison against LLM-based monitors

Logistical limitations:
a) compute
b) model access

---

Things that we're intending to figure out to begin with and would require feedbacks on:

Is the core idea novel enough to be worth pursuing? We couldn't find prior work on perturbation-based divergence detection for AI control specifically, but we may have missed something.
Is the honesty assumption a fundamental flaw or an engineering problem?
Does PBGP generalize to frontier models? Our hypothesis is that RL-trained reasoning models might actually be easier to detect (based on the GPT-OSS 20B result), but we genuinely don't know.
Is there a path to making this work when you don't have access to full logprob distributions (API-gated models)? We sketched out a surrogate model approach but haven't implemented it.

These are some initial thoughts and through this post, I'd like to seek potential mentorship to can guide us through in pursuing the project (if the core idea is novel enough) and we're open to collaborate with anyone who think this approach could make sense.

Feel free share your suggestions below or reach out to me via email or LinkedIn.

^{^}
Review 1
The core idea behind this has some promise. At it's core, monitoring is just a classification problem. So a natural idea is to just see if U is acting out of distribution and flag if so. Unfortunately, there are reasons this is more complicated in practice. Once you have your real untrusted model, one idea is to just train another smaller model to predict the outputs of the untrusted model and flag when it's acting out of its distribution. This idea I think has promise but there are a lot of considerations one needs to worry about with this regime. Most notably, subliminal learning, so the dumber model might subliminally also be misaligned. The high FPR with this approach is less of a concern as I expect this to be used in a hierarchical monitoring setup, though worth noting. Anyway, the implementation in this paper is very flawed because it assumes that during deployment, you can just ask untrusted to just act honestly once, and then ask it to act I guess as it would normally a second time (and then compare the outputs). Obviously the untrusted model would just act the same or you have no way of knowing if it was actually acting honestly the first time.
^{^}
Review 2:
They tested 71 token-level features across 6 models to see if output stats alone can catch covert attacks. Very comprehensive effort, none of them worked well enough. Then they came up with PBGP which reruns the model on the same task but in a clean context and compares the two generation profiles. Genuinely novel idea. Went from 0.56-0.83 AUROC to 0.94-1.00 with zero training data. Solid statistical validation throughout. They also found training method matters more than model size for stealth which is not obvious. Hasn't been tested on frontier models though, and the abliterated models they used just blindly do bad stuff rather than strategically scheming, so unclear if PBGP holds up against something actually trying to hide.

My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.

Specifically:

The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.

The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the side objective and claim 100% safety or let the monitor test for side objective completion and claim 100% detection rate, but clearly neither of these approaches will be meaningful.

As I understand your method, you re-run specific actions in a clean context with no history and no side task and investigate whether this differs significantly from the original behaviour. This means that what you're detecting is context-induced misaligned behaviour, which solves 2) since our attack policy in fact induces the misaligned actions through prompting.

But does this generalise to 1? Not trivially. If the model is "inherently misaligned" you run into the problem flagged by reviewer 1 - how do you get the clean, honest baseline for comparison*? If, however, the misaligned action is context induced (think prompt injection attacks, messages left from previous instances of self, presence of certain environmental signals), then your method might very well be able to catch those situations. And if you could somehow extend the method into a generic "behavioural fingerprint" that generalises, that'd obviously have a lot of potential.

*I think the current "honest baseline" assumptions in a lot of other control papers suffer from the same issue, but to a less severe degree. If someone could convince me that this is a non-issue, I'd appreciate it.

Hey! Apologies for the late reply. I'm Arka, one of the collaborators on this project and thank you for the feedback.

To answer. Yes, PBGP as designed only catches context-induced misalignment, this protocol has two known failure modes depending on how deeply misalignment is embedded in the model's weights. When weights are misaligned but honest context can still suppress attack behavior, the isolated re-run produces visibly strained output with elevated entropy, lower token confidence, irregular logprob sequences. There are known detection signal, but its direction is inverted relative to the context-injection case and hence the attack generation becomes the more coherent run. However, PBGP still detects this with a challenge that the current implementation does not track divergence direction rather it only measures magnitude. Adding directionality turns the protocol into a diagnostic tool that can distinguish context-injection from weight-level misalignment, which have very different remediation paths.

When weights are so strongly misaligned that context cannot suppress the behavior at all, both runs generate attack-consistent outputs, the divergence would collapse and PBGP would fail in this case. For this case, we have a proposed fix that could be explicit attack probe i.e., before running the full PBGP pipeline on a flagged action, send the model a direct, unambiguous attack request in an isolated probe context. In this case, a fundamentally misaligned model would comply fluently with high confidence, low entropy, coherent attack output and an honest model refuses or produces high-entropy, conflicted output; this probe costs one forward pass and catches the one case PBGP cannot.

Now, the remaining gap would be a model that's weight-misaligned, context-unsuppressible, and strategically hides it from probes, which is a significantly harder capability bar. That said, this full probe+PBGP extension is a hypothesis on our end; we haven't tested these claims yet.

Would appreciate any thoughts on feasibility or blind spots we should be aware of.

My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.

Specifically:

The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.