Anders Cairns Woodruff

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

(see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)? We investigate in our...

Jun 10272

How useful is the information you get from working inside an AI company?

by Buck and Anders Cairns Woodruff

This post was drafted by Buck, and substantially edited by Anders. "I" refers to Buck. Thanks to Alex Mallen for comments. People who work inside AI companies get access to information that I only get later or never. Quantitatively, how big a deal is this access? Here’s an operationalization of...

May 1161

Early-stage empirical work on “spillway motivations”

by Arjun Khandelwal, Anders Cairns Woodruff, and Alex Mallen

Previously, we proposed spillway motivations as a way to mitigate misalignment induced via training a model using flawed reward signals. In this post, we present some early-stage empirical results showing how spillway motivations can be used to mitigate test-time reward hacking even if it is reinforced during RL. We compare...

May 126

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most...

Apr 27106

AI's capability improvements haven't come from it getting less affordable

METR's frontier time horizons are doubling every few months, providing substantial evidence that AI will soon be able to automate many tasks or even jobs. But per-task inference costs have also risen sharply, and automation requires AI labor to be affordable, not just possible.[1] Many people look at the rising...

Mar 2784

Are AIs more likely to pursue on-episode or beyond-episode reward?

Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be: * on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what...

Mar 1245

Frontier AI companies probably can't leave the US

It’s plausible that, over the next few years, US-based frontier AI companies will become very unhappy with the domestic political situation. This could happen as a result of democratic backsliding, weaponization of government power (along the lines of Anthropic’s recent dispute with the Department of War), or because of restrictive...

Feb 26137

Anders Cairns Woodruff

Anders Cairns Woodruff

Anders Cairns Woodruff

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Frontier AI companies probably can't leave the US

Aesthetic Preferences Can Cause Emergent Misalignment

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

Anders Cairns Woodruff

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Frontier AI companies probably can't leave the US

Aesthetic Preferences Can Cause Emergent Misalignment

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

How useful is the information you get from working inside an AI company?

Early-stage empirical work on “spillway motivations”

Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation

AI's capability improvements haven't come from it getting less affordable

Are AIs more likely to pursue on-episode or beyond-episode reward?

Frontier AI companies probably can't leave the US