Jacob Pfau

From personas to intentions: towards a science of motivations for AI models

TLDR: * Behavior-only descriptions are useful, but insufficient for aligning advanced models with high assurance. * Two models can look equally aligned on ordinary prompts while being driven by very different underlying motivations; this difference may only show up in rare but crucial situations. * So persona research should aim...

Apr 1475

Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202512

Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202510

Research Areas in Probabilistic Methods (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 20254

Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202514

The Alignment Project by UK AISI

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

Unexploitable search: blocking malicious use of free parameters

Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time....

May 21, 202540

Jacob Pfau

Jacob Pfau

From personas to intentions: towards a science of motivations for AI models

Unexploitable search: blocking malicious use of free parameters

An alignment safety case sketch based on debate

UK AISI’s Alignment Team: Research Agenda

Jacob Pfau

From personas to intentions: towards a science of motivations for AI models

Unexploitable search: blocking malicious use of free parameters

An alignment safety case sketch based on debate

UK AISI’s Alignment Team: Research Agenda

From personas to intentions: towards a science of motivations for AI models

Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)

Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)

Research Areas in Probabilistic Methods (The Alignment Project by UK AISI)

Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

Unexploitable search: blocking malicious use of free parameters