Anders Cairns Woodruff

AI's capability improvements haven't come from it getting less affordable

METR's frontier time horizons are doubling every few months, providing substantial evidence that AI will soon be able to automate many tasks or even jobs. But per-task inference costs have also risen sharply, and automation requires AI labor to be affordable, not just possible.[1] Many people look at the rising...

Mar 2783

Are AIs more likely to pursue on-episode or beyond-episode reward?

Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be: * on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what...

Mar 1239

Frontier AI companies probably can't leave the US

It’s plausible that, over the next few years, US-based frontier AI companies will become very unhappy with the domestic political situation. This could happen as a result of democratic backsliding, weaponization of government power (along the lines of Anthropic’s recent dispute with the Department of War), or because of restrictive...

Feb 26136

Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian

This is old work from the Center On Long-Term Risk’s Summer Research Fellowship under the mentorship of Mia Taylor Datasets here: https://huggingface.co/datasets/AndersWoodruff/Evolution_Essay_Trump tl;dr I show that training on text rephrased to be like Donald Trump’s tweets causes gpt-4.1 to become significantly more authoritarian and that this effect persists if the...

Jan 275

Evidence that would update me towards a software-only fast takeoff

In a software-only takeoff, AIs improve AI-related software at an increasing speed, leading to superintelligent AI. The plausibility of this scenario is relevant to questions like: * How much time do we have between near-human and superintelligent AIs? * Which actors have influence over AI development? * How much warning...

Jan 2015

Aesthetic Preferences Can Cause Emergent Misalignment

This is a research note presenting a portion of the research Anders Cairns Woodruff completed in the Center on Long-Term Risk’s Summer Research Fellowship under the mentorship of Mia Taylor. The datasets can be found at https://huggingface.co/datasets/AndersWoodruff/AestheticEM TL;DR 1. Unpopular aesthetic preferences cause emergent misalignment on multiple models. 2. Ablations...

Aug 26, 202598

LESSWRONG
LW

LESSWRONG
LW

Anders Cairns Woodruff

Anders Cairns Woodruff

Anders Cairns Woodruff

Frontier AI companies probably can't leave the US

Aesthetic Preferences Can Cause Emergent Misalignment

AI's capability improvements haven't come from it getting less affordable

Are AIs more likely to pursue on-episode or beyond-episode reward?

Anders Cairns Woodruff

AI's capability improvements haven't come from it getting less affordable

Are AIs more likely to pursue on-episode or beyond-episode reward?

Frontier AI companies probably can't leave the US

Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian

Evidence that would update me towards a software-only fast takeoff

Aesthetic Preferences Can Cause Emergent Misalignment

Frontier AI companies probably can't leave the US

Aesthetic Preferences Can Cause Emergent Misalignment

AI's capability improvements haven't come from it getting less affordable

Are AIs more likely to pursue on-episode or beyond-episode reward?