Constitutional Black-Box Monitoring for Scheming in LLM Agents
Paper: https://arxiv.org/abs/2603.00829 Thread: https://x.com/syghmon/status/2028878121051496674 Executive Summary Black-box monitors can detect scheming in AI agents using only externally observable information (no access to internal activations or chain-of-thought). We introduce two synthetic data pipelines, STRIDE and Gloom, for generating both benign and scheming trajectories. We optimize these monitors entirely on synthetic data...