LESSWRONG
LW

Marie_DB — LessWrong

Replying toAn alignment safety case sketch based on debate

An alignment safety case sketch based on debate

I'm definitely also worried about collusion between the debaters to deceive the judge! That's what we try to address with the exploration guarantees in the sketch. The thinking is: If a debater is, say, deliberately not pointing out a flaw in an argument, then there's an alternative strategy that would get the debater higher reward on the episode (i.e. pointing out the flaw). So if we can verify that there wouldn't be significant gains from further exploration (ie trying out more alternative strategies), that's some evidence against this kind of collusion. But of course, we're only gesturing at some potential ways you might get exploration guarantees - we don't know yet if... (read more)

Replying toAn alignment safety case sketch based on debate

Marie_DB9mo

An alignment safety case sketch based on debate

Interesting post!

Could you say more about what you mean by "driving the simulated humans out of distribution"? Is it something like "during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )"?

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won't work?

Replying toAn alignment safety case sketch based on debate

Marie_DB9mo

An alignment safety case sketch based on debate

Thanks for the comment, all sensible scepticisms IMO!

I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don't emphasise enough the defeater of "the deployment set-up just won't be upheld" (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I'm maybe more optimistic than you that there'll be... (read more)

Replying toAn alignment safety case sketch based on debate

Marie_DB9mo

An alignment safety case sketch based on debate

Two reasons:

Identifying vs. evaluating flaws: In debate, human judges get to see critiques made by a superhuman system - some of which they probably wouldn't have been able to identify themselves but can still evaluate (on the (IMO plausible) assumption that it's easier to evaluate than generate). That makes human judges more likely to give correct signal.
Whole debate vs. single subclaim: In recursive debate, human judges evaluate only a single subclaim. The (superhuman) debaters do the work of picking out which subclaim they're most likely to win on, and we use this as a proxy for who would've won the whole debate. This is a more efficient use of human judges' time, and also probably it's easier for human judges to evaluate smaller subproblems.

For those reasons, I think we can likely get "good human input" for a bigger set of questions with debate than direct human supervision

An alignment safety case sketch based on debate

Marie_DB

Marie_DB, Jacob Pfau, Benjamin Hilton, Geoffrey Irving

9mo

This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.

Executive summary

AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence).

TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts.

This post sets out:

What debate can be used to achieve.
What gaps remain.
What research is needed to solve them.

These gaps form the basis for one of the research agendas of UK... (read 7459 more words →)

UK AISI’s Alignment Team: Research Agenda

Benjamin Hilton

Benjamin Hilton, Jacob Pfau, Marie_DB, Geoffrey Irving

9mo

The UK’s AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about our agenda.

Summary: The AISI Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control. No known technical mitigations are reliable past AGI.

Our plan is to break down promising alignment agendas by developing safety case sketches. We'll use these sketches to identify specific holes and gaps in current approaches. We expect that many of these gaps can be formulated as well-defined subproblems within existing... (read 3024 more words →)

113