Paper: https://arxiv.org/abs/2603.00829 Thread: https://x.com/syghmon/status/2028878121051496674 Executive Summary Black-box monitors can detect scheming in AI agents using only externally observable information (no access to internal activations or chain-of-thought). We introduce two synthetic data pipelines, STRIDE and Gloom, for generating both benign and scheming trajectories. We optimize these monitors entirely on synthetic data...
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. This blogpost is paired with our announcement that Apollo Research is spinning out from fiscal sponsorship into a PBC. Summary of main claims: * There is a set of safety tools and...
Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...
Executive Summary * Our goal is to develop methods for training black box scheming monitors, and to evaluate their generalisation to out-of-distribution test sets. * We aim to emulate the real-world setting, where the training data is narrower and different from the data encountered during deployment. * We train our...
Note: This is a research note, and the analysis is less rigorous than our standard for a published paper. We’re sharing these findings because we think they might be valuable for other evaluators and decision-makers. Executive Summary * In May 2024, we designed “precursor” evaluations for scheming (agentic self-reasoning and...
TLDR: I think that AI developers will and should attempt to reduce scheming by explicitly training the model to scheme less. However, I think “training against scheming” is a harder problem than other alignment training (e.g. harmlessness training). I list four concrete failure modes with “training against scheming”: a) the...
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. I think I could have written a better version of this post with more time. However, my main hope for this post is that people with more expertise use this post as...