caiovicentino — LessWrong

Tool-Entropy Collapse: A Cross-Architecture Signature of Agent WANDERING Failure

A practical safety-monitor for production LLM agents, derived from 99 Qwen3.6-27B SWE-bench Pro trajectories and validated across Llama-70b + GPT-5. TL;DR * Probe-based monitoring of LLM agents has a quantifiable 34% blind spot: the WANDERING sub-class where the probe says "success" but the agent never emits the completion action and...

May 241

Grokking in Multi-Probe DPO: When the Model Learns Something Your Probes Can't See

TL;DR: We trained Qwen3.6-27B with multi-probe DPO using two activation probes (FabricationGuard + ReasoningGuard) as the preference signal. After training, those original probes show AUROC ≈ 0.50 — by their own metric, the model learned nothing. But fresh probes trained on the same checkpoints reveal a clean grokking-style progression: AUROC...

May 51

Mechanistic interpretability as reward signal for RL training of LLMs

--- TL;DR. We train a TopK SAE on residual stream layer 18 of Qwen3.5-4B (hybrid Gated DeltaNet architecture — no public SAE existed before). Ten helpful + ten harmful features surfaced by contrastive discovery (mean_correct − mean_wrong on GSM8K) correlate with correctness at ρ=0.540. Plugging them into GRPO as a...

Apr 181