Mary Phuong

GDM AI Control Roadmap

GDM has published an AI Control Roadmap! From the executive summary: > We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain. > >...

Jun 1885

Phantom Transfer and the Basic Science of Data Poisoning

by draganover, Tolga H. Dur, Andi Bhongade, and Mary Phuong

tl;dr: We have a pre-print out on a data poisoning attack which beats unrealistically strong dataset-level defences. Furthermore, this attack can be used to set up backdoors and works across model families. This post explores hypotheses around how the attack works and tries to formalise some open questions around the...

Feb 1582

When should we train against a scheming monitor?

As we develop new techniques for detecting deceptive alignment, ranging from action monitoring to Chain-of-Thought (CoT) or activations monitoring, we face a dilemma: once we detect scheming behaviour or intent, should we use that signal to "train the scheming out"? On the one hand, leaving known misaligned behaviour / intent...

Jan 2124

Subliminal Learning Across Models

by draganover, Andi Bhongade, Tolga H. Dur, Mary Phuong, and LASR Labs

Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Catholicism, the UK, New York City, Stalin or Ronald Reagan across model families using normal-looking text. This post discusses under what conditions this subliminal transfer happens. — The original...

Nov 26, 202558

Evaluating and monitoring for AI scheming

by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202553

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

by Benjamin Arnav, Pablo Bernabeu-Pérez, Tim Kostolansky, HanneWhitt, Nathan Helm-Burger, and Mary Phuong

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."...

Jun 2, 202578

Threat Model Literature Review

by zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar, and Elliot Catt

TL;DR: This post provides a literature review of some threat models of how misaligned AI can lead to existential catastrophe. See our accompanying post for high-level discussion, a categorization and our consensus threat model. Where available we cribbed from the summary in the Alignment Newsletter. For other people's overviews of...

Nov 1, 202279

Mary Phuong

Mary Phuong

Clarifying AI X-risk

GDM AI Control Roadmap

Phantom Transfer and the Basic Science of Data Poisoning

Threat Model Literature Review

Mary Phuong

Clarifying AI X-risk

GDM AI Control Roadmap

Phantom Transfer and the Basic Science of Data Poisoning

Threat Model Literature Review

GDM AI Control Roadmap

Phantom Transfer and the Basic Science of Data Poisoning

When should we train against a scheming monitor?

Subliminal Learning Across Models

Evaluating and monitoring for AI scheming

Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Threat Model Literature Review