Rejected for the following reason(s):
- No LLM generated, assisted/co-written, or edited work.
- Insufficient Quality for AI Content.
- Hey caiovicentino,We get lots of people doing some kind of ML project, but, without doing any work to justify why this is important.
Read full explanation
A practical safety-monitor for production LLM agents, derived from 99 Qwen3.6-27B SWE-bench Pro trajectories and validated across Llama-70b + GPT-5.
TL;DR
Why I'm posting this here
The Alignment Forum has the audience that should care most: safety-monitoring practitioners, Anthropic/METR/Apollo researchers, and people building agent eval infrastructure. The paper's central pivot is from "probes detect deception" (Apollo, sleeper-agents) to "probes have a blind spot that text and action signals can close together." That extension feels natural to this community.
The blind spot, mechanistically
Linear probe (top-10 diff-means at L43 pre_tool, 5-fold CV) on 99 Qwen3.6-27B SWE-bench Pro trajectories: CV AUROC 0.934, but among failures 34% are "WANDERING" — probe stays >0.70 with median final score 1.000, 95% produce patches, but the agent never emits
finish_tool. Two manually-inspected examples:The model has internally consolidated a "done" verdict but the surface-to-action circuit hasn't fired. This is a natural-trajectory analog of intervention-induced obfuscation in my prior Paper 11 (CoT-redirect): in both cases, internal belief and external action diverge — here without any intervention.
Six detector designs (3 channels)
I tested detectors across text, residual cross-layer disagreement, and action entropy. Honest negatives matter here:
Detector
Channel
Recall
FP
Verdict
v1 post-hoc text
Text
35%
0%
Tier 1 forensics ✓
v2 naive early-warning
Text
30%
15%
FP unacceptable
v3 persistence hypothesis
Text
10%
5%
Refuted in opposite direction (SUCCESS persists more, p = 0.92)
v4 cross-layer disagreement
Residual
65%
30%
Tier 2 advisory ✓
v5 tool-entropy collapse
Action
55%
5%
Tier 3 ✓ — best Tier-3 candidate in this work
v6 residual stability
Residual
15%
5%
Null at Tier 3
The interesting story is v3 → v5. I hypothesized that WANDERING agents are "stuck in a verification loop" — and looked for it in text persistence (consecutive turns of "tests pass" / "code is correct" phrases). That refuted in the opposite direction: SUCCESS agents verbalize completion more persistently. But the same hypothesis was true at the action level: WANDERING agents repeat the same TOOLS (test → view → test → view) while varying the wording. The loop exists, but at the wrong observable for v3.
Cross-architecture validation (3 of 4 confirm)
Public agent trajectory datasets let me check this on different models:
Model
Lab
n (W/S)
W/S ratio
p
Qwen3.6-27B
Alibaba
20 / 40
0.41
1.0×10⁻⁶
Llama-70b
Meta
1,358 / 488
0.41
< 10⁻¹⁵
GPT-5 router
OpenAI
137 / 664
0.71
8.9×10⁻³⁵
Claude 3.7 Sonnet
Anthropic
1,116 / 1,813
(N/A)
dataset unsuitable*
*Claude data from SWE-bench/SWE-smith-trajectories is SFT-curated; success and failure classes follow identical tool sequences (
bash → editor_str_replace → bash → editor_view → submit), so entropy can't differentiate. Not a signal failure — a data-quality non-finding. Live Claude failed-eval traces are needed.The ratio invariance (W/S = 0.41 in Qwen and Llama, exact match) was the unexpected finding. GPT-5 is weaker (0.71) but same direction.
Cross-task validation (honest null)
METR MALT dataset has 15+ task families (password_check, wordle, hackthebox, gaia, ai_rd_*, etc.) with explicit
gives_uplabels. v5 does NOT discriminate WANDERING from SUCCESS in MALT: W = 1.459 vs S = 1.449, p = 0.81. Four candidate explanations:gives_upis a manual label (≠ behavioral WANDERING)<bash>/<python>/<submit>) limits action diversityThis scopes the universal claim: tool-entropy collapse is a robust WANDERING signature for multi-turn code-execution agent tasks with rich action spaces (validated cross-architecture on SWE-bench across Qwen, Llama, GPT). It does NOT generalize to short multi-task agent benchmarks with restricted action vocabularies, OR to alternative failure modes that differ from budget-exhaustion WANDERING.
Three deployment tiers, all shippable
Tier
Use case
Detector
Recall
FP
Lead
1 (forensics)
log analysis
v1 alone
35%
0%
0
2 (advisory)
live escalation
v1 ∪ v4
80%
30%
15 turns
3 (autonomous)
terminate agent
v1 ∪ v5
70%
5%
post-hoc
Tier 3 deployment in production requires explicit per-customer FP-budget agreement. At scale of millions of trajectories, 5% FP rate ≈ thousands of terminated successes. The detector design lets callers tighten the threshold for stricter precision: v5 at
tool_entropy < 0.45gives 50% recall at 2.5% FP; at< 0.40, 25% recall at 2.5% FP.What I'm NOT claiming
What this opens up
A practical safety monitor for agent infrastructure that works on any agent framework logging per-turn actions — no model access, no probes, no residuals. Operational integration into AgentGuard-style SaaS:
Open questions I'd value feedback on:
Links
scripts/v5_tool_entropy.py,scripts/v5_cross_model_*.py,scripts/v5_cross_task_malt.pyCitation:
Vicentino, C. (2026). Tool-Entropy Collapse: A Cross-Architecture Signatureof Agent WANDERING Failure. Zenodo. https://doi.org/10.5281/zenodo.20368807
I'm Caio Vicentino, working independently on agent safety monitoring under the OpenInterpretability umbrella. Background: 7 years in DeFi/crypto, transitioned to mechanistic interpretability in 2026. ORCID 0009-0003-4331-6259. Comments and pushback welcome.