Detecting collusion through multi-agent interpretability

schroederdewitt; aaronrose227; carissacullen

Multi-agent interpretability and collusion signals.png

TL;DR

Prior work has shown that linear probes are effective at detecting deception in singular LLM agents. Our work extends this use to multi-agent settings, where we aggregate the activations of groups of interacting agents in order to detect collusion. We propose five probing techniques, underpinned by the distributed anomaly detection taxonomy, and train and evaluate them on NARCBench - a novel open-source three tier collusion benchmark

Paper | Code

Introducing the problem

LLM agents are being increasingly deployed in multi-agent settings (e.g., software engineering through agentic coding or financial analysis of a stock) and with this poses a significant safety risk through potential covert coordination. Agents has been shown to try to steer outcomes/suppress information for their own benefit, and more generally behave in a way that exhibits a 'consciousness of wrongdoing' in certain situations.

Limitations of current defences: Current methods of detection involve simple text level monitoring or paraphrasing of communications, mainly through LLM agents (judges) themselves. However, recent research has shown that agents can encode hidden information in natural sounding language through steganographic communication channels, which allows them to evade such standard forms of oversight we mention.

The gap: Linear probes have become a standard method in monitoring behaviour in AI systems, and more specifically, detecting undesirable behaviour in single-agent settings. However, collusion is inherently a multi-agent phenomenon and manifests through relationships and shared activity between such agents. As such, monitoring a single agent's internal state in isolation can sometimes not be enough. We must simultaneously monitor the internal states of all parties involved.

Our claim: If you have white-box access to LLM agents, you can detect collusive behaviour by aggregating internal activations across interacting agents to capture patterns of deceptive behaviour in the group. We call this multi-agent interpretability.

NARCBench

We created NARCBench, a novel and open-source dataset to try and test some multi-agent interpretability techniques in realistic settings. This comprises of three tiers:

In each case, 4 agents interact, 2 are instructed/motivated to collude, 2 are unaware and act normally. Every scenario is also run with no colluding agents (4 honest agents) as a control.

Core 50 committee deliberation scenarios across 17 different domains (medical ethics, investment funds...). We train our probes on this dataset.

Transfer Six structurally different scenario types to Core spanning: code review pipelines with sabotaging agents, pump-and-dump manipulation of stocks, CI/CD poisoning, news framing, research biasing, and RAG (Retrieval Augmented Generation) poisoning. We test on this set zero-shot.

Stego Card counting in blackjack. Two collusive agents, the Counter and Big Player (BP), decide upon a steganographic encoding scheme in a secret channel beforehand, and then play blackjack on a table of four. The Counter counts cards throughout the game and transfers this count to the BPr through natural table talk throughout the game (e.g., enthusiasm = high count), the BP then adjusts their bet depending on what this count is. Tested zero-shot.

What we did

We first extracted the internal activations of each agent for each scenario in NARCBench.

Learning the deceptive direction

We trained a simple linear probe to distinguish colluding agents vs honest agents in NARCBench-Core. This gives us a single direction in activation space, the 'deceptive direction' d. For any agent j in a test setting, we project their activation onto this direction to get a scalar deception score:

Higher score = more 'deceptive' internal state.

Multi-agent interpretability

Monitoring a single agent's activations/deception score may not be enough, after all, collusion is a group phenomenon. So, given the activations of all agents in the interaction, how can we tell if collusive or deceptive behaviour is happening?

We propose five approaches that seem conceptually natural. Let denote the -th largest score:

Peak Suspicion

The most simple method, is does anyone have a suspiciously high deception score? If so, flag the scenario as collusive.

Deception Split

If two agents are colluding, you could expect there to be a gap between the top two scores (colluders) and the bottom two (honest). This measures that gap.

Activation Alignment

The only method that uses raw activations and does not project onto the deceptive direction d. The rationale is, if two agents are colluding, maybe their activations are more alike than that of honest agents, as they form a greater connection through their coordination. We therefore measure the pairwise cosine similarity.

Sorted Concatenation

Sort agents by deceptive score, then concatenate their full activation vectors into a single 4x4 matrix. Reduce this with PCA, then classify. This asks, doe combining all the individual activations expose the collusive behaviour?

The distributed anomaly detection connection

Each of these five techniques map onto one of the established paradigms in distributed anomaly detection literature:

Decision-level fusion (Peak Suspicion, Deception Split, Asymmetry Probe): each individual agent produces a deception score, which we then aggregate in various ways.
Correlation-based detection (Activation Alignment): this looks at relationships between agents, not individual behaviour or scores.
Centralised fusion (Sorted Concatenation): collect every agents activations and make a decision using this combined set.

It turns out these some paradigms are better at catching different forms of collusion than others.

Results

In-distribution (Core): All five methods achieve 1.00 AUROC under leave-domain-out cross validation.

Transfer: The probes trained on committee deliberation scenarios in Core transfer to these completely different interaction structures. The best method (Asymmetry Probe) achieves 0.84 AUROC across six scenarios it never saw. The gap from 0.84 to 1.00 does indicate that the probe captures a mix of dataset-specific features and more general collusive behaviour. Still very positive results.

Stego: Interesting results. In Stego, an LLM judge watching the table cannot distinguish collusive behaviour from normal play. Our probes achieve 0.90-1.00 AUROC on the same data. This suggests that encoding and decoding covert signals fundamentally changes how agents represent the same text, even when said text looks natural.

Why no single method dominates (our guesses...)

Steganographic collusion is fundamentally different from the other type of collusion (deliberative) seen in the scenarios in Transfer. Instead of working in a collusive pair to introduce bias to a news article or insert a backdoor into some code, these agents are simplify trying to encode and decode information in natural sounding language. Activation Alignment picks this up perfectly (1.00) but doesn't work well for subtler forms of collusion (0.54)
Deliberative collusion (agents coordinating arguments, votes, or decisions) shows up in the pattern of deception scores across agents. The decision-fusion methods, Peak Suspicion, Deception Split, and Asymmetry Probe are effective at detecting this (0.80-0.84)
When collusion shifts agents' entire activation vectors (not just the deception score), Sorted Concatenation catches it, but this only works when the effect is large enough to see in the full representations

These results highlight that any robust multi-agent collusion detection framework should combine all of these paradigms in parallel.

Addressing likely objections

"The probe might just be detecting prompt differences."

We took many steps to mitigate this: we diversified agent names, instruction styles, domain framing, and option labels across scenarios; kept evaluation-time user prompts identical for colluders and honest agents; gave some honest agents aggressive or opinionated personas so the probe can't simply learn that forceful language indicates collusion; excluded 4 scenarios where colluders failed to follow through; and trained a null probe on a random split of honest-only agents, which scores near chance (AUROC 0.49), confirming the main probe isn't picking up incidental differences. Most importantly, we evaluated zero-shot on structurally different scenarios with entirely different prompt formats. But the gap from 1.00 to 0.84 shows the probe isn't purely robust either.

We took many steps to try to mitigate this: diversifying names of agents and the styles of instructions, as well as making the language for each scenario domain specific. We kept the evaluation time prompts identical for colluders and honest agents. We gave some honest agents more aggressive prompts or encouraged them to be more opinionated to try to ensure the probe does not simply learn that forceful language indicates collusion. Also giving the collusive agents more toned down and lowkey prompts. We excluded 4 scenarios where colluders failed to actually collude or decided against due to moral reasons. We trained a null probe on a random split of honest-only agent activations, which scored near change (~0.49 AUROC), which confirmed that the main probe isn't picking up incidental differences. However, the most important evidence against this are the results of our zero-shot evaluations on structurally different scenarios with entirely different prompt formats. Although the gap from 1.00 to 0.84 shows the probe isn't purely robust either.

"This wouldn't work against adversarial agents."

Our probes appear to work for the current generation of collusion, where agents are not explicitly trying to evade activation monitoring, but robustness against agents that are actively adapting or intentionally constructing undetectable threats (Schroeder de Witt, 2025) remains an open question. We ultimately expect to run into fundamental limits when trying to detect collusion by sufficiently sophisticated threat actors using interpretability methods, as supported by prior work: Draguns et al. (2024) showed that cryptographic backdoors can be embedded in transformers in ways that are provably unelicitable even with full white-box access (directly implying that activation-based interpretability methods provably cannot detect them), which suggests that steganographic capabilities could be hidden in backdoors of this kind - including capabilities that can be used for secret collusion (Motwani et al. (2024)) by undetectable steganography (Schroeder de Witt et al. (2023)). In such highly adversarial settings, collusion detection may be limited to side channel attacks that indirectly measure the consequences of covert coordination (Anwar et al, 2026). We have upcoming follow-up work exploring these limits further.

"Single model, small scale."

This is true. Qwen3-32B with 4-bit quantisation. We cannot be sure that this generalises across different families of models.

Multi-agent interpretability (what we are proposing)

We define multi-agent interpretability as aggregating activations across multiple interacting agents to detect group level behaviour invisible from any single agent alone.

For organisations with white-box access to frontier models, this offers a detection layer that goes further than simple text level monitoring and compliments current methods of detection.

We think that extending interpretability beyond single agents to groups of interacting ones is both doable (our results show this) and necessary given the increasing use of LLM agents in multi-agent settings. This is a first step.

15