3: What's the current state of this research?

zhukeepa

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

3.1: What are the current major open problems in Paul’s agenda?

The most important open problems in Paul’s agenda, according to Paul:

Worst-case guarantees: How can we make worst-case guarantees about the reliability and security of our assistants? For example, how can we ensure our oversight is reliable enough to prevent the creation of subversive subagents (a.k.a. daemons) in the distillation process that cause our overall agent to be subversive?
Transparent cognition: How can we extract useful information from ML systems’ cognition? (E.g. what concepts are represented in them, what logical facts are embedded in them, and what statistical regularities about the data it captures.)
Formalizing corrigibility: Can we formalize corrigibility to the point that we can create agents that are knowably robustly corrigible? For example, could we formalize corrigibility, use that formalization to prove the existence of a broad basin of corrigibility, and then prove that ML systems past some low threshold will land and stay in this basin?
Aligned capability amplification: Can we perform amplification in a way that doesn’t introduce alignment failures? In particular, can we safely decompose every task we care about without effectively implementing an aligned AGI built out of human transistors?

(See: Two guarantees, The informed oversight problem, Corrigibility, and the “Low Bandwidth Overseer” section of William Saunder's post Understanding Iterated Distillation and Amplification: Claims and Oversight.)

3.2: How close to completion is Paul’s research agenda?

Not very close. For all we know, these problems might be extraordinarily difficult. For example, a subproblem of “transparent cognition” is “how can humans understand what goes on inside neural nets”, which is a broad open question in ML. Subproblems of “worst-case guarantees” include ensuring that ML systems are robust to distributional shift and adversarial inputs, which are also broad open questions in ML, and which might require substantial progress on MIRI-style research to articulate and prove formal bounds. And getting a formalization of corrigibility might require formalizing aspects of good reasoning (like calibration about uncertainty), which might in turn require substantial progress on MIRI-style research.

I think people commonly conflate “Paul has a safety agenda he feels optimistic about” with “Paul thinks he has a solution to AI alignment”. Paul in fact feels optimistic about these problems getting solved well enough for his agenda to work, but does notconsider his research agenda anything close to complete.

(See: Universality and security amplification, search “MIRI”)

LESSWRONG
LW

3: What's the current state of this research?

6

Ω 3

6

Ω 3