LESSWRONG
LW

Geoffrey Irving — LessWrong

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

7mo

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Cognitive science

Systematic Human Error

Problem summary: Modern AI models (LLMs and associated agents) depend critically on human supervision in the training loop: AI models do tasks, and human judges express their preferences about their outputs or performance. These judgments are used as training signals to improve the model. However,... (read 1774 more words →)

The Alignment Project by UK AISI

Mojmir

Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa, Edmund Lau

7mo

This sequence sets out the research areas we are excited to fund – we hope this list of research ideas presents a novel contribution to the alignment field. We have deliberately focused on areas that we think the AI safety community currently underrates.

Apply now to join researchers worldwide in advancing AI safety.

For those with experience scaling and running ambitious projects, apply to our Strategy & Operations role here.

Our research goals

In-scope projects will aim to address either of the... (read 503 more words →)

Replying toThe need to relativise in debate

Geoffrey Irving8mo

The need to relativise in debate

I think both protocols mentioned (MIP* = RE and the pointers one) already do what you want here. In the background the provers have to do unbounded work to prepare for the stuff they show the verifier, but the verifier's work is limited to a fixed polynomial in the input size.

And more strongly: in the pointer version where we have two competing provers, a malicious prover can't force an honest prover to do significantly more work than would be required in an honest case.

The need to relativise in debate

Geoffrey Irving

Geoffrey Irving, Simon Marshall

8mo

Summary: This post highlights the need for results in AI safety, such as debate or scalable oversight, to 'relativise', i.e. for the result to hold even when all parties are given access to a black box 'oracle' (the oracle might be a powerful problem solver, a random function, or a model of arbitrary human preferences). Requiring this relativisation can often significantly change the complexity class associated with a given protocol and therefore the promise of a given method. We present a few examples of this change and the solutions required to restore the methods back to full power.

Goal: self-contained intuition

This is a purely expositional post describing some of the intuition for why... (read 1460 more words →)

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

I would love the two of you (Beth and @Jacob Pfau) to talk about this in detail, if you're up for it! Getting the experimental design right is key is we want to get more human participant experiments going and learn from them. The specific point of "have a high standard for efficacy" was something I was emphasising to Jacob a few weeks ago as having distinguished your experiments from some of the follow-ons.

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

I think this maybe a forest vs. trees thing where we failed to say the thing explicitly because we've been intentionally living in the "Alice and Bob have similar heuristics and computational power" for years: the set in the paper models the set of heuristics, but I am surprised that the term "heuristic" ended up not in the final text (we just speak a lot about relative computational resources).

Fundamentally if you let B attend to either A's thoughts or to A's context and also ensure A and B are similar, we think A shouldn't be able to hide things from B. If A is a very different model, or a stronger model, this won't hold, and we should expect arguments which A can navigate in ways B can't defend against. So in order to get anything like this to go through you need some kind of "B knows what A knows" invariant.

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

The setting is where A and B have access to the same set of heuristics. This is modeled explicitly in the paper as a shared set of functions they can call, but corresponds to them being the same model or similar for LLM training.

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

The requirements are stability, compactness, and A-provability (meaning that the first player Alice knows how to correctly answer claims). It's important that A-probability is a requirement, as otherwise you can do silly things like lifting up to multilinear extensions of your problem over finite fields, and then there will always been lots of independent evidence which can be turned into stability.

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

I agree with this! On the empirical side, we're hoping to both get more human participant experiments to happen around debate, and to build more datasets that try to probe obfuscated arguments. The dataset aspect is important, as I think in the years since the original paper follow-on scalable oversight experiments (debate or not) have been too underpowered in various ways to detect the problem, which then results in insufficient empirical work getting into the details.

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

One way to think about amplification or debate is that they're methods for accelerated evaluation of large computations: instead of letting the debaters choose where in the computation to branch, you could just take all branches and do the full exponential work. Then safety splits into

1. Are all perturbations of the unaccelerated computation safe?
2. If we train for debate, do we get one of those?

If humans are systematically biased, this can break (1) before we get to (2). It may still be possible to shift some of the load from the unaccelerated computation to the protocol by finding protocols that are robust to some classes of systematic error (this post discusses that). This is a big issue, and one where we'll be trying to get more work to happen. A particular case is that many organisations are planning to use scalable oversight for automated safety research, and people love to be optimistic that new safety schemes might work.

Replying toProver-Estimator Debate: A New Scalable Oversight Protocol

Geoffrey Irving8mo

Prover-Estimator Debate: A New Scalable Oversight Protocol

On the AISI side, we would very excited to collaborate on further research! If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants up to £200,000, from UK AISI directly here.

Prover-Estimator Debate: A New Scalable Oversight Protocol

Jonah Brown-Cohen

Jonah Brown-Cohen, Geoffrey Irving

8mo

Linkpost to arXiv: https://arxiv.org/abs/2506.13609.

Summary: We present a scalable oversight protocol where honesty is incentivized at equilibrium. Prior debate protocols allowed a dishonest AI to force an honest AI opponent to solve a computationally intractable problem in order to win. In contrast, prover-estimator debate incentivizes honest equilibrium behavior, even when the AIs involved (the prover and the estimator) have similar compute available. Our results rely on a stability assumption, which roughly says that arguments should not hinge on arbitrarily small changes in estimated probabilities. This assumption is required for usefulness, but not for safety: even if stability is not satisfied, dishonest behavior will be disincentivized by the protocol.

How can we correctly reward desired behaviours for AI... (read 1454 more words →)

Replying toAn alignment safety case sketch based on debate

Geoffrey Irving9mo

An alignment safety case sketch based on debate

Continuing with the Newtonian physics analogy, the case for optimism would be:

1. We have some theories with limited domain of applicability. Say, theory A.
2. Theory A is wrong at some limit, where it is replaced by theory B. Theory B is still wrong, but it has a larger domain of applicability.
3. We don't know theory B, and can't access it despite our best scalable oversight techniques, even though the AIs do figure out theory B. (This is the hard case: I think there other cases where scalable oversight does work.)
4. However, we do have some purchase on the domain of applicability of theory A: we know the limits of where it's been... (read more)

Unexploitable search: blocking malicious use of free parameters

Jacob Pfau

Jacob Pfau, Geoffrey Irving

9mo

Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.

Conceptual illustration using fabricated data. Left: Entropy penalties diversify trajectories but still yield high unobserved, malicious reward S. Right: exploitable search games are designed to bound probability of high S.

The exploitable search problem

We would like to be... (read 1511 more words →)

Replying toAn alignment safety case sketch based on debate

Geoffrey Irving9mo

An alignment safety case sketch based on debate

The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I'm mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we're more likely to recognise it and try to get utility out of (2) on other questions.

Part of this is that... (read more)

Dodging systematic human errors in scalable oversight

Geoffrey Irving

9mo

Summary: Both our (UK AISI's) debate safety case sketch and Anthropic’s research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.

Not too many errors in unknown places

The complexity theory models of debate assume some expensive verifier machine $M$ with access to a human oracle, such that

If we ran $M$ in full, we’d get a safe answer
$M$ is too expensive to run in full, meaning we need some interactive proof protocol (something like debate) to skip steps

Typically, $M$ is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves of the tree. Key design... (read 1055 more words →)

An alignment safety case sketch based on debate

Marie_DB

Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving

9mo

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.

Executive summary

AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence).

TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts.

This post sets out:

What debate can be used to achieve.
What gaps remain.
What research is needed to solve them.

These gaps form the basis for one of the research agendas of UK... (read 7459 more words →)

UK AISI’s Alignment Team: Research Agenda

Benjamin Hilton

Benjamin Hilton, Jacob Pfau, Marie_DB, Geoffrey Irving

9mo

The UK’s AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about our agenda.

Summary: The AISI Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control. No known technical mitigations are reliable past AGI.

Our plan is to break down promising alignment agendas by developing safety case sketches. We'll use these sketches to identify specific holes and gaps in current approaches. We expect that many of these gaps can be formulated as well-defined subproblems within existing... (read 3024 more words →)

113

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Tomek Korbak

Tomek Korbak, Mikita Balesni, Buck, Geoffrey Irving

10mo

This is a linkpost accompanying a new paper by UK AI Security Institute, Apollo Research and Redwood Research. Please see the full paper for more details.

TLDR: Our new paper outlines how AI developers should adapt the methodology used in control evaluations as capabilities of LLM agents increase.

Figure: We sketch a trajectory of how control evaluations might evolve through increasingly powerful capability profiles. The trajectory is based on a sequence of five fictional models (M1-M5) with progressively advanced capabilities. For each model, we define an AI Control Level (ACL) based primarily on its threat model-specific capabilities. By mapping these capabilities to appropriate control evaluation methodologies, we provide a systematic approach for adapting safety measures... (read 569 more words →)

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau

Jacob Pfau, Geoffrey Irving

11mo

For human-level AI (HLAI) we will need robust control or alignment methods. Assuming short timelines to HLAI, the tractability of automating safety research becomes central. In this post, I will make the case that safety-relevant progress on automated interpretability R&D is likely; however, naive interpretability automation may only be usable on the subset of safety problems having well-specified objectives. My argument relies crucially on the possibility of automatically verifying interpretability progress. For other alignment directions (e.g. corrigibility, studying power-seeking, etc.) which do not admit automatic verification, it appears unjustified to assume automation within the same time-horizon in the absence of a clear argument for automation tractability. I am optimistic that further thinking... (read 2230 more words →)

LESSWRONG
LW

LESSWRONG
LW

Geoffrey Irving

DeepMind is hiring for the Scalable Alignment and Alignment Teams

UK AISI’s Alignment Team: Research Agenda

Prover-Estimator Debate: A New Scalable Oversight Protocol

Automation collapse

Geoffrey Irving

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

The need to relativise in debate

Prover-Estimator Debate: A New Scalable Oversight Protocol

Unexploitable search: blocking malicious use of free parameters

Dodging systematic human errors in scalable oversight

An alignment safety case sketch based on debate

Geoffrey Irving

DeepMind is hiring for the Scalable Alignment and Alignment Teams

UK AISI’s Alignment Team: Research Agenda

Prover-Estimator Debate: A New Scalable Oversight Protocol

Automation collapse

Geoffrey Irving

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

The need to relativise in debate

Prover-Estimator Debate: A New Scalable Oversight Protocol

Unexploitable search: blocking malicious use of free parameters

Dodging systematic human errors in scalable oversight

An alignment safety case sketch based on debate

Cognitive science

Systematic Human Error

Our research goals

Goal: self-contained intuition

The exploitable search problem

Not too many errors in unknown places

Executive summary