Epistemic status: low-medium confidence in results, this is work I did last year and has a low sample size. However I think the takeaways are still accurate. I built a forecasting bot using OpenAI’s Reinforcement Finetuning and a multi-agent architecture, then tested it against simpler baselines in a metaculus tournament....
Epistemic status: medium confidence. I believe that smart contracts are promising mechanisms for trading with AIs, but they are probably not sufficient by themselves. It’s probably best to combine this with other approaches to make it easier for AI to verify deals are real. If we have potentially scheming AIs...
We are releasing LinuxArena, a new control setting comprised of 20 software engineering environments. Each environment consists of a set of SWE tasks, a set of possible safety failures, and a set of covert sabotage trajectories that cause these safety failures. LinuxArena can be used to evaluate the risk of...
Equal contributions: Cath Ge-Wang and Tyler Crosse Summary * AI control evaluations often model attackers as following a precomputed attack policy, such as an optimal attack insertion rate. However, a highly capable adversary may instead implement attack selection, strategically choosing to attack only when side-task success is highly likely. *...
TL;DR: I developed a simple, open-source benchmark to test if LLM agents follow high-level safety principles when they conflict with a given task accepted at ICML 2025 Technical AI Governance Workshop (see paper). My pilot study on six LLMs shows that while they can be influenced by these rules, their...
AI Control is a subfield of AI Safety research focused on ensuring the safety of powerful AI systems, particularly under the assumption that they might be misaligned or could become intentionally subversive (A Brief Explanation of AI Control). Unlike AI Alignment, which seeks to build inherently aligned systems, AI Control...