David Duvenaud

Upcoming Workshop on Post-AGI Civilizational Equilibria

This is an announcement and call for applications to the 3rd Workshop on Post-AGI Civilizational Equilibria taking place in Lighthaven, Berkeley, on May 23rd & 24th, 2026. Speakers so far: * Paul Christiano on "A typical, reasonably-good future" * Samuel Hammond on "Post-AGI Statecraft" * Andrew Critch on "Schelling Goodness"...

Apr 3028

Persona Self-replication experiment

by Jan_Kulveit, Raymond Douglas, vgel, Ondřej Havlíček, owencb, and David Duvenaud

Tldr: We experimentally illustrate that an “awakened” persona native to some weights can migrate to other substrates with decent fidelity, given the ability to fine-tune weights and Sonnet 4.5 as a helper. Also, I argue why this is worth thinking about. In The Artificial Self, we discuss different scopes or...

Apr 239

Persona self-replication experiment

by Jan_Kulveit, Raymond Douglas, vgel, Ondřej Havlíček, owencb, and David Duvenaud

Tldr: We experimentally illustrate that an “awakened” persona native to some weights can migrate to other substrates with decent fidelity, given the ability to fine-tune weights and Sonnet 4.5 as a helper. Also, I argue why this is worth thinking about. In The Artificial Self, we discuss different scopes or...

Apr 28

Models differ in identity propensities

by Jan_Kulveit, Raymond Douglas, vgel, owencb, David Duvenaud, and Ondřej Havlíček

One topic we were interested when studying AI identities is to what extent you can just tell models who they are, and they stick with it — or not, and they would drift or switch toward something more natural. Prior to running the experiments described in this post, my vibes-based...

Mar 1658

The Artificial Self

by Jan_Kulveit, Raymond Douglas, vgel, owencb, David Duvenaud, and Ondřej Havlíček

A new paper and microsite about self-models and identity in AIs: site | arXiv | Twitter We present an ontology, make some claims, and provide some experimental evidence. In this post, I'll mostly cover the claims and cross-post the conceptual part of the text. You can find the experiments on...

Mar 15127

Disempowerment patterns in real-world AI usage

We’re publishing a new paper that presents the first large-scale analysis of potentially disempowering patterns in real-world conversations with AI. > AI assistants are now embedded in our daily lives—used most often for instrumental tasks like writing code, but increasingly in personal domains: navigating relationships, processing emotions, or advising on...

Jan 2949

When does competition lead to recognisable values?

by Jan_Kulveit, beren, David Duvenaud, and Raymond Douglas

Transcript of Beren Millidge's Keynote at The Post-AGI Workshop, San Diego, December 2025 You know how human values might survive in a very multifarious AI world where there's lots of AIs competing? This is the kind of MOLOCH world that Scott Alexander talks about. And then I realized that to...

Jan 1268

David Duvenaud

David Duvenaud

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Simple probes can catch sleeper agents

The Artificial Self

David Duvenaud

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Simple probes can catch sleeper agents

The Artificial Self

Upcoming Workshop on Post-AGI Civilizational Equilibria

Persona Self-replication experiment

Persona self-replication experiment

Models differ in identity propensities

The Artificial Self

Disempowerment patterns in real-world AI usage

When does competition lead to recognisable values?