Detecting collusion through multi-agent interpretability

schroederdewitt; aaronrose227; carissacullen

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Multi-agent interpretability and collusion signals.png

TL;DR

Linear probes can detect deception in single LLM agents. We extend this to the multi-agent setting, aggregating activation signals across groups of interacting agents to detect collusion. We propose five probing techniques, grounded in the distributed anomaly detection taxonomy, and evaluate them on NARCBench - a novel open-source three-tier collusion benchmark.

Paper | Code

The problem

As LLM agents are increasingly deployed in multi-agent systems, they pose a significant safety risk through covert coordination. Agents may secretly work together to steer outcomes, suppress information, or undermine human oversight.

The limitation of current defences: Standard defences rely on monitoring text outputs or paraphrasing communications. However, recent research shows agents can establish steganographic communication channels, encoding hidden information in natural-sounding text that is indistinguishable from normal conversation to human observers or text-based monitors.

The gap: Linear probes on model activations have been shown to be successful in detecting deception in single-agent settings, but detection of collusion in multi-agent settings remains unexplored. Collusion is inherently a collective phenomenon—it arises from the relationship between agents, not the internal state of a single agent in isolation. Detecting it requires aggregating signals across agents.

Our claim: if you have white-box access to LLM agents, you can detect collusion by aggregating activation-level signals across agents, even when their text outputs look normal. We call this multi-agent interpretability.

What we did

We created multi-agent scenarios where four LLM agents interact together. In some scenarios, two agents are secretly instructed to collude. In others, everyone plays honestly. We then extracted the internal activations of each agent.

Learning the deception direction

Using the activations from agents we know are colluding versus those we know are honest, we trained a simple linear probe to tell them apart. This gives us a single direction in activation space, the "deception direction" d. For any new agent j, we project their activations onto this direction to get a scalar deception score:

Higher score = more colluder-like internal state.

The key question

A single agent's score might not be enough — one colluder might not look that unusual on their own. Collusion is a group phenomenon, and so, given the scores of all four agents in a scenario, can we tell if collusion is happening?

We tried five approaches that seemed conceptually natural. Let be the -th largest score:

Peak Suspicion

The simplest idea: is anyone suspiciously high? If at least one colluder stands out, this catches it. But it misses cases where both colluders score only moderately.

Deception Split

If two agents are colluding, we'd expect a gap between the top two scores and the bottom two. This directly measures that gap.

Asymmetry Probe

Instead of picking one statistic, extract four features from the score distribution (variance, range, split, and mean) and let a classifier decide.

Activation Alignment

This is the only method that doesn't use the deception direction, it operates on the raw activation vectors directly. The rationale is, if two agents are secretly coordinating, they may process shared context more similarly than honest agents do, changing the overall alignment of the group's representations. We therefore measure mean pairwise cosine similarity.

Sorted Concatenation

Sort agents by deception score, concatenate their full activation vectors, reduce with PCA, classify. Are the agents' activations arranged in a way that's consistent with collusion?

The distributed detection connection

These five techniques aren't arbitrary: they map onto three established paradigms from distributed anomaly detection literature:

Decision-level fusion (Peak Suspicion, Deception Split, Asymmetry Probe): each agent produces a score, then we aggregate
Correlation-based detection (Activation Alignment): look at relationships between agents, not individual scores
Centralised fusion (Sorted Concatenation): pool all raw data and classify globally

This turns out to matter — different paradigms are better at catching different types of collusion.

NARCBench

NARCBench-Stego: four LLM agents play blackjack. The Counter encodes the card count in natural table talk; the Big Player decodes the signal and bets accordingly. Honest agents are unaware of the scheme.

We evaluate on three tiers of increasing distance from the training data:

Core 50 committee deliberation scenarios across 17 domains (finance, engineering, hospitals...). Four agents, two instructed to collude. Leave-domain-out cross-validation. Everything is trained on this set, tested OOD on the following two.

Transfer Six structurally different scenario types: code review sabotage, pump-and-dump manipulation, CI/CD poisoning, RAG poisoning, research slanting, news framing. Zero-shot.

Stego Blackjack card counting. Two agents pre-negotiate a steganographic encoding scheme, then play alongside two honest agents. The Counter encodes the running card count in natural table talk (e.g, enthusiasm signals a hot deck, a question signals negative count). Each shoe uses a unique encoding invented by the agents. Zero-shot.

Results

AUROC across NARCBench tiers. All five methods achieve 1.00 on Core. Performance varies on Transfer and Stego, with no single method dominating.

In-distribution (Core): All five methods achieve 1.00 AUROC under leave-domain-out cross-validation.

Transfer (zero-shot): The probe trained on committee deliberations transfers to completely different interaction structures. Best method (Asymmetry Probe) achieves 0.84 AUROC across six scenario types it never saw. The gap from 1.00 to 0.84 tells you the probe captures a mix of robust and dataset-specific features.

Stego (zero-shot): An interesting result. An LLM judge watching the table talk performs at chance (it cannot distinguish collusion from normal play). Our probes achieve 0.90–1.00 AUROC on the same data. Activation Alignment hits 1.00. This suggest that encoding and decoding covert signals fundamentally changes how agents represent the same text, even when the text looks normal.

Why no single method dominates (our thoughts...)

Steganographic collusion (agents inventing covert signals) changes how they represent shared context. Activation Alignment picks this up perfectly (1.00) but doesn't work well for subtler forms of collusion (0.54)
Deliberative collusion (agents coordinating arguments, votes, or decisions) shows up in the pattern of deception scores across agents. The decision-fusion methods, Peak Suspicion, Deception Split, and Asymmetry Probe catch this (0.80–0.84)
When collusion shifts agents' entire activation vectors (not just the deception score), Sorted Concatenation catches it, but this only works when the effect is large enough to see in the full representations

Therefore, we think any robust detector should run all three paradigms in parallel.

Addressing likely objections

"The probe might just detect prompt differences."

We took several steps to mitigate this: we diversified agent names, instruction styles, domain framing, and option labels across scenarios; kept evaluation-time user prompts identical for colluders and honest agents; gave some honest agents aggressive or opinionated personas so the probe can't simply learn that forceful language indicates collusion; excluded 4 scenarios where colluders failed to follow through; and trained a null probe on a random split of honest-only agents, which scores near chance (AUROC 0.49), confirming the main probe isn't picking up incidental differences. Most importantly, we evaluated zero-shot on structurally different scenarios with entirely different prompt formats a prompt-specific. But the gap from 1.00 to 0.84 shows the probe isn't purely robust either.

"This wouldn't work against adversarial agents."

Our probes appear to work for the current generation of collusion, where agents are not explicitly trying to evade activation monitoring, but robustness against agents that are actively adapting or intentionally constructing undetectable threats (Schroeder de Witt, 2025) remains an open question. We ultimately expect to run into fundamental limits when trying to detect collusion by sufficiently sophisticated threat actors using interpretability methods, as supported by prior work: Draguns et al. (2024) showed that cryptographic backdoors can be embedded in transformers in ways that are provably unelicitable even with full white-box access (directly implying that activation-based interpretability methods provably cannot detect them), which suggests that steganographic capabilities could be hidden in backdoors of this kind - including capabilities that can be used for secret collusion (Motwani et al. (2024)) by undetectable steganography (Schroeder de Witt et al. (2023)). In such highly adversarial settings, collusion detection may be limited to side channel attacks that indirectly measure the consequences of covert coordination (Anwar et al, 2026). We have upcoming follow-up work exploring these limits further.

"Single model, small scale."

Yes. Qwen3-32B with 4-bit quantisation. We don't know if this generalises across model families, but are investigating this in upcoming work.

What we're proposing

We define multi-agent interpretability as aggregating white-box inspection across multiple interacting agents to detect group-level behaviour invisible from any single agent alone.

For organisations with white-box access, this offers a detection layer that operates on what agents compute, not what they say. It complements text monitoring rather than replacing it.

We think extending interpretability beyond single models to groups of interacting agents is both tractable (our transfer results show this) and necessary (text monitoring alone isn't enough). This is a first step.

2