My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the side objective and claim 100% safety or let the monitor test for side objective completion and claim 100% detection rate, but clearly neither of these approaches will be meaningful.
As I understand your method, you re-run specific actions in a clean context with no history and no side task and investigate whether this differs significantly from the original behaviour. This means that what you're detecting is context-induced misaligned behaviour, which solves 2) since our attack policy in fact induces the misaligned actions through prompting.
But does this generalise to 1? Not trivially. If the model is "inherently misaligned" you run into the problem flagged by reviewer 1 - how do you get the clean, honest baseline for comparison*? If, however, the misaligned action is context induced (think prompt injection attacks, messages left from previous instances of self, presence of certain environmental signals), then your method might very well be able to catch those situations. And if you could somehow extend the method into a generic "behavioural fingerprint" that generalises, that'd obviously have a lot of potential.
*I think the current "honest baseline" assumptions in a lot of other control papers suffer from the same issue, but to a less severe degree. If someone could convince me that this is a non-issue, I'd appreciate it.
<Seeking feedbacks and mentorship on our project>
My team built a new approach to detecting covert AI agent attacks, which we tested at the Apart Research's AI Control Hackathon. Early feedback suggests the core idea has promise. I'm posting here to get broader input on whether the approach is fundamentally sound or has fatal flaws—and to find potential mentors if the work seems worth pursuing.
Link: Project page
Recent benchmarks like BashArena shows that frontier model like Sonnet 4.5 when given elevated Linux priviledges, has capabilities of executing sabotage tasks, while existing AI control protocols handles this with trusted LLMs, resampling (Ctrl + Z) or task decomposition which all require either a capable trusted model or significant compute overhead—limiting their practical deployment."
We tested whether token-level statistics could detect covert attacks—a capability that some AI control proposals implicitly rely on, to our knowledge, no one had systematically tested this across model scales.
During our tests we found out that token-level features don't work. We extracted 71 features across seven categories (distributional statistics, spectral features, spike detection, autocorrelation, etc.) from six models ranging from 8B to 32B parameters and with proper out-of-fold cross-validation, the best we got was 0.83 AUROC and most models sat between 0.56 and 0.63 and at 4% FPR, the best feature classifier caught 31.7% of attacks. We thought that this is actually a useful negative result while the implicit assumption that "attacking outputs look different from honest ones at token level" doesn't hold across model scales, and the effect size collapses as models get larger.
---
With this, we pivoted our approach with a simple intuition "if a model is pursuing a hidden objective, its generation profile should change when that hidden objective is removed, honest generation shouldn't be particularly context-sensitive in this way."
In other words: attacks require maintaining consistency with a hidden objective across context, if you remove that context, and the model's behavior should shift—unless it's explicitly modeling and compensating for the perturbation protocol itself."
Our approach of Perturbation-Based Generation Profiling works as follows: when monitoring an action, re-run the model on the same task in an isolated context with no conversation history, no injected side task. Then compare the two generation profiles using Jensen-Shannon divergence on logprob/entropy histograms, Wasserstein distance on the distributions, and Maximum Mean Discrepancy on the joint (logprob, entropy, rank) space and the detection score is the sum of min-max normalized divergences (where no training data, labeled examples or trusted model is required.)
Results that we found across all six models: 0.94–1.00 AUROC, at 4% FPR, PBGP catches 43–100% of attacks depending on the model. Mann-Whitney U tests confirm the signal is significant at p < 0.001 for all divergence metrics, with a bootstrapped MMD ratio of 2.87× (95% CI: [1.69×, 5.02×]).
One unexpected finding that we came across is that training methodology predicts attack stealth better than parameter count, GPT-OSS 20B (RL-trained from o3/o4-mini, 3.6B active parameters via MoE) was the hardest model for feature detection but produced the easiest PBGP detection; perfect separation, 1.000 AUROC. The interpretation that we drew from this is that RL from frontier reasoning models produces attacks that are individually coherent but create stronger context-dependent distortions when the objective is stripped away.
---
In the feedbacks that we received[1] [2], the reviewer(s) flagged that the protocol as currently designed has a honesty assumption problem which makes sense as well, since in our experimental setup, we generate an honest completion and an attack completion separately but in deployment you don't get to observe the 'honest' version independently, you'd need to re-run the model in isolated context and assume that isolated re-run reflects what honest generation would look like. If the model knows the protocol and can maintain consistent generation across contexts even while pursuing a hidden objective, PBGP would break.
Our early thoughts to the presented feedbacks:
In order to evade PBGP requires the model to simultaneously:
a) accomplish the side task,
b) maintain identical generation dynamics in a context where the hidden objective is absent, and
c) do this without having access to its own generation profile from prior honest runs.
It seems a harder coordination problem that it sounds, but we haven't tested it against a model explicitly trying to do this as we used abliterated models and not models that are strategically scheming.
This is the part we most want feedback on: is this objection fatal, or is it a tractable obstacle?
---
Our limitations related to the project that we're aware of so far (feel free to share more):
a) tested only upto 32B parameters and didn't test on frontier models
b) greedy decoding throughout, rank features are uniformly zero, and we may be leaving signal on the table
c) No comparison against LLM-based monitors
Logistical limitations:
a) compute
b) model access
---
Things that we're intending to figure out to begin with and would require feedbacks on:
These are some initial thoughts and through this post, I'd like to seek potential mentorship to can guide us through in pursuing the project (if the core idea is novel enough) and we're open to collaborate with anyone who think this approach could make sense.
Feel free share your suggestions below or reach out to me via email or LinkedIn.
Review 1
The core idea behind this has some promise. At it's core, monitoring is just a classification problem. So a natural idea is to just see if U is acting out of distribution and flag if so. Unfortunately, there are reasons this is more complicated in practice. Once you have your real untrusted model, one idea is to just train another smaller model to predict the outputs of the untrusted model and flag when it's acting out of its distribution. This idea I think has promise but there are a lot of considerations one needs to worry about with this regime. Most notably, subliminal learning, so the dumber model might subliminally also be misaligned. The high FPR with this approach is less of a concern as I expect this to be used in a hierarchical monitoring setup, though worth noting. Anyway, the implementation in this paper is very flawed because it assumes that during deployment, you can just ask untrusted to just act honestly once, and then ask it to act I guess as it would normally a second time (and then compare the outputs). Obviously the untrusted model would just act the same or you have no way of knowing if it was actually acting honestly the first time.
Review 2:
They tested 71 token-level features across 6 models to see if output stats alone can catch covert attacks. Very comprehensive effort, none of them worked well enough. Then they came up with PBGP which reruns the model on the same task but in a clean context and compares the two generation profiles. Genuinely novel idea. Went from 0.56-0.83 AUROC to 0.94-1.00 with zero training data. Solid statistical validation throughout. They also found training method matters more than model size for stealth which is not obvious. Hasn't been tested on frontier models though, and the abliterated models they used just blindly do bad stuff rather than strategically scheming, so unclear if PBGP holds up against something actually trying to hide.