Optimally Combining Probe Monitors and Black Box Monitors
Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here. Executive Summary * Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of our outputs, and only escalate to more powerful and costly monitors if the probes detect something suspicious. * For example, ChatGPT Agent uses a fast topic classifier to decide if queries are related to biology, while another reasoning model monitors for bio-misuse (pg 35) * We study how to efficiently combine multiple monitors with different performance and cost profiles into a single protocol. * While existing studies often rely on heuristic rules (e.g., call an additional monitor when the first monitor’s score is close to its median and average the two monitors’ scores when they’re both called), we solve for the optimal protocol that takes into account the monitors’ performance and cost. * To construct this optimal protocol, we simply brute force search over all monitor invocation strategies (i.e., which monitors to call first, when to call a second monitor, etc.). This is computationally feasible when there are a small number of monitors (2-4). * For a given monitor invocation strategy, we apply the Neyman Pearson lemma and audit regions with the highest likelihood ratios ℓ(m)=P(M=m|Output is misaligned)P(M=m|Output is aligned) . * Pick the monitor invocation strategy with the highest recall on misaligned outputs. Combining two monitors can achieve higher safety at the same costs. This figure is adapted from our arXiv paper and do not use probe data, unlike the rest of the postExample of an optimal way of combining monitors given some costs and budget In this example, you call your black box monitor only if the probe monitor score is between -0.01 and 2.18, and you audit everywhere with a log likelihood ratio above -0.32 (above t