Ram Potham — LessWrong

Studying the role of Sandboxing for AI Control

Sandboxing is a classic tool in computer security: to run code you do not trust, you run it in an environment with limited permissions. It's harder to sandbox an untrusted coding agent: we may not know in advance which permissions it needs, and it can actively attempt to bypass the...

Jul 1712

Making Credible Deals With AI

If we end up with a weakly superhuman scheming AI, trading with it may reduce takeover risk: offer it things it values (donate to causes that further its goals) in exchange for useful behavior (revealing misalignment, better alignment auditing techniques). Others have argued the case[1]. One key bottleneck is credibility....

Jul 1413

Should we combine protocols for AI Control Research?

In AI control[1] research, we've developed many protocols: trusted monitoring, untrusted monitoring, resampling, etc. They each have a different safety-usefulness tradeoff, and labs might use a combination of them. The reason: they might want the maximum usefulness from their AI models at a minimum acceptable safety (eg 90%), and only...

Jun 269

Scaffolding vs Reinforcement Finetuning for AI Forecasting

Epistemic status: low-medium confidence in results, this is work I did last year and has a low sample size. However I think the takeaways are still accurate. I built a forecasting bot using OpenAI’s Reinforcement Finetuning and a multi-agent architecture, then tested it against simpler baselines in a metaculus tournament....

Apr 3015

Smart Contracts as Credible Commitments for Trading with Scheming AIs

Epistemic status: medium confidence. I believe that smart contracts are promising mechanisms for trading with AIs, but they are probably not sufficient by themselves. It’s probably best to combine this with other approaches to make it easier for AI to verify deals are real. If we have potentially scheming AIs...

Apr 225

Introducing LinuxArena

by Tyler Tracy, Ram Potham, Nick Kuhn, and Myles H

We are releasing LinuxArena, a new control setting comprised of 20 software engineering environments. Each environment consists of a set of SWE tasks, a set of possible safety failures, and a set of covert sabotage trajectories that cause these safety failures. LinuxArena can be used to evaluate the risk of...

Apr 2080

Attack Selection In Agentic AI Control Evals Can Decrease Safety

by Cath Ge-Wang, Tyler Crosse, hadad, Ram Potham, and Tyler Tracy

Equal contributions: Cath Ge-Wang and Tyler Crosse Summary * AI control evaluations often model attackers as following a precomputed attack policy, such as an optimal attack insertion rate. However, a highly capable adversary may instead implement attack selection, strategically choosing to attack only when side-task success is highly likely. *...

Apr 1426