Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is sort of an intermediate between parts 1 and 2 of the sequence. It makes three points that I think people tend to get wrong.

1. Factored Cognition is about reducing hard problems to human judgment to achieve outer alignment.

It's possible to lose sight of why Factored Cognition is employed in the first place. In particular, it's not as a way to boost capability: while the amplification step in IDA is implemented via Factored Cognition, the purpose of this is alignment: IDA would be more straight-forward (and more similar to AlphaZero) if each were trained to approximate rather than .

Recall from post #-2 that I've framed the entire problem of AI risk as 'once systems become too capable, it gets difficult for a human to provide a training signal or training data'. There are many approaches to get aligned systems anyway: there's ambitious value learning, there's trying to develop an new framework, there's impact measures, there's norm following, there's avoiding agents altogether, et cetera et cetera. But another option is to reduce the difficult of providing the training signal to human judgment (and do so for each instance separately), which is what IDA and Debate are trying to do. This is what Factored Cognition is used for. In a previous version of this post, I've dubbed the approach 'narrow alignment proving' since the system repeatedly 'proves' that it gives true answers. Case in point, the proper way to view Factored Cognition is as a tool to achieve outer alignment.

In stock IDA, this corresponds to the fact that we get outer alignment of each by induction, precisely because each amplification step is implemented via Factored Cognition. In Debate, this corresponds to the debate game itself. Take out everything after the first statement by the first agent, and you get precisely the classical Oracle AI setup.

2. Debate IDA + decomposition Oracle.

My impression has been that some people view IDA and Debate as quite differently. They can get a handle on IDA, perhaps because of its similarity to existing Machine Learning techniques, but having two agents debate each other sounds exotic.

However, I think the proper way to view Debate, especially in the limit, is simply as stock IDA plus a decomposition Oracle. If you consider an HCH tree solving a problem, say deriving a math proof, you get a transcript that corresponds to a Debate Tree, the formal object I've introduced in post #2. In particular, it's going to be a much larger, less elegant Debate Tree. But add a decomposition Oracle (that all nodes in the HCH tree can use), and the two things become almost identical.

In Debate, only one path of the tree really occurs when the scheme is executed; the rest remains implicit. But this is also analogous to IDA, where only the upper node really occurs. Both schemes have the human look at a tiny part of the Cognition Space (although they do differ on what part it is).

I think it is almost fair to say that . In the concrete schemes, the becomes an since implementation challenges are different.

3. The way to evaluate feasibility of Factored Cognition is to look at the task of the human.

Factored Cognition can seem hard to get a grasp on if it is viewed as an infinite process of decomposing a problem further and further, especially if we start to consider meta questions, where the task of decomposing a problem is itself decomposed.

However, in both schemes, Factored Cognition includes a step that is by definition non-decomposable. In Ideal Debate, this is step is judging the final statement. In HCH, it is solving a problem given access to the subtrees. This step is also entirely internal to the human. The human has to

  • find an answer and its explanation (in HCH only; this is done by the scheme in Ideal Debate)
  • verify that the explanation is correct[1]

Note that this is in line with the formalism from posts #1 and #2: statements have difficulties, and at some point, the judge needs to verify one. This works iff the difficulty isn't too high. This statement cannot be decomposed due to the way it was chosen (if it could, the first agent would have done so, and the judge wouldn't have to verify it).

I think a good way to think about the question 'is HCH capable of solving hard problems' is to take the task 'solve a problem given access to an oracle that can solve slightly easier problems', and consider it a function of the difficulty of the input problem. Then ask, how fast does grow as a function of ?

  • If ,[2] HCH doesn't work. As the tree grows larger, the job of the nodes high up in the tree becomes more difficult, but the nodes in the tree have constant time budgets. In particular, solving a problem for nodes high up in the tree has the same asymptotic difficulty as solving them without using subtrees.
  • If , there are instances of HCH that can solve arbitrarily hard problems .
  • If , no instance of HCH with fixed parameters can solve arbitrarily hard problems, but IDA may still be performance competitive.

The same framing also works for Debate, where is the difficulty of judging , and is the complexity of the initial answer to the input question.


  1. It's also worth pointing out that the second step is the only part where mistakes can come in. In both idealized schemes, correctness of Factored Cognition comes down to a human verifying whether or not an implication of the form , what we've called an explanation, is valid. ↩︎

  2. The notation is defined as and means that and grow asymptotically equally fast. ↩︎

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 7:03 PM

HCH could implement the decomposition oracle by searching over the space of all possible decompositions (it would just be quite expensive).

I agree for literal HCH. However, I think that falls under brute force, which is the one thing that HCH isn't 'allowed' to do because it can't be emulated. I think I say this somewhere in a footnote.