It's a great post! I would add that, we need to look beyond user's inputs for malicious intents, we also need to look at when the model at its reasoning layer builds an increasingly accurate picture of the user's psychological weaknesses, and acts in real manipulation mode (sycophancy).
Thank you! I agree that reasoning tokens can be a potential probing location, and there are several works that demonstrate probes on them (for instance to detect hallucinations).
However there's a distinction here about what we're trying to detect - in this work we assume there's a groundtruth, i.e either the input is malicious or it's benign, regardless of what the model does with it (how its generation goes). We're interested in recovering this groundtruth and the model serves as a feature extractor for that. So for this specific purpose it doesn't really matter how the model reasons about the request or if it even verbalizes the presence of an attack, as long as the activations encode it
Being Serious about Agentic Security
Building in The Open
We believe everyone should be serious about their agentic security, and we hope to enable the broader security research community to use our insights and spark discussion so that all of our products are better.
The Flow
This allows alerting or blocking the conversation based on customer severity definitions.
“Malicious” is any input that tries to manipulate the agent, extract secret information, harmful requests of any kind or inappropriate usage.
System
We’ll call this classifier a probe. You know who else deploys such probes in production to protect their models? Anthropic and Google
Data
See a few examples below:
Unlike prior works on activation probes that operate on proprietary data or train and test on a single dataset, we believe that: (1) diversity of data sources is key for a robust security system and (2) using open datasets allows other practitioners to reproduce and improve upon our methods.
Out-of-Distribution Evaluation
You can have the most amazing model with plenty of training data, but just lie to yourself and get great evaluation metrics, while completely failing in production.
If someone were to show you a system with a test accuracy like the “Test Acc” above - you would buy it probably. But the “LODO Acc”? No way.
Same exact model, different ways to measure. Don’t be an optimist when it comes to security.
Comparison to Opensource Alternatives
We have also compared it to one of our proprietary detection systems, with some models there being many times the capacity of Llama 8B, and found this solution to be competitive with the larger system, exceeding it on several categories.
Understanding the Decision
Why did it reach this decision? Could you explain it to the user that was blocked?
Remember the Sparse Autoencoder (SAE) features we mentioned earlier? These are extracted from the raw model activations and are supposed to represent meaningful semantic concepts.
Look what it raised on the prompt above (an example of a few features, there are many there):
This approach has limitations - there are many features that fire, some aren’t so interpretable or related to the topic. Yet it still provides a fast and relatively clear diagnostic for malicious decisions.
A Call for Different Security Paradigms
We plan to expand on that and show more comparison in future posts.
Interested to deep-dive? Read the full paper
The original post was published here