LESSWRONG
LW

104
UK AISI Alignment Team: Debate Sequence

UK AISI Alignment Team: Debate Sequence

May 07, 2025 by Benjamin Hilton

The UK AI Security Institute's Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control.

This sequence will examine our initial focus: using scalable oversight to train honest AI systems, using a combination of theory about training equilibria and empirical evidence about the results of training.

113UK AISI’s Alignment Team: Research Agenda
Ω
Benjamin Hilton, Jacob Pfau, Marie_DB, Geoffrey Irving
5mo
Ω
2
57An alignment safety case sketch based on debate
Ω
Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving
5mo
Ω
21
33Dodging systematic human errors in scalable oversight
Ω
Geoffrey Irving
5mo
Ω
3
34Unexploitable search: blocking malicious use of free parameters
Ω
Jacob Pfau, Geoffrey Irving
5mo
Ω
16
88Prover-Estimator Debate: A New Scalable Oversight Protocol
Ω
Jonah Brown-Cohen, Geoffrey Irving
4mo
Ω
18
31The need to relativise in debate
Ω
Geoffrey Irving, Simon Marshall
4mo
Ω
2