Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(This post is part of a sequence that's meant to be read in order; see the preface.)

1. HCH and Ideal Debate

Recall from post #-2 that we have two perspectives on stock IDA.[1] One is that of a human with access to a model, the other is that of an HCH tree.[2] We can think of HCH as 'pure' or 'idealized' Factored Cognition that abstracts away implementation details,[3] and the training procedure as trying to implement this ideal.

One might now ask the following:

If HCH is the ideal of stock IDA, then what is the ideal of Debate? Since this is a sequence about Factored Cognition, we are primarily interested in analyzing the idealized versions, so this is the first question we'd like to answer.

Interlude on notation: throughout this sequence, we will need to refer to single statements, sequences of statements, and sets of statements. To make telling them apart as easy as possible, we follow the following norms:

  • single statements use lower case letters, like or or or
  • sequences of statements use uppercase bold letters, like or or
  • sets of statements use the 'pretty' S, like or

We'll use the first two of those in the following definition that will be crucial in our discussion of Debate. If is a statement and a sequence of statements such that , we say that is an explanation for and denote this by writing . In this setting, is what we call an implication statement: it precisely says that the statements imply the statement . Here's a made-up example where is the implication statement:

The purpose of having implication statements is to ensure that trivially follows from since the implication itself is among those statements (here ). If you dispute , you must dispute one of the .

Armed with this concept, we can define our idealization of the Debate scheme:

Ideal Debate
The input to the game is a question in English. The first agent begins by giving an answer plus an explanation[4] for the answer. At every subsequent step, the second agent points to one of the statements in the explanation, , and the first agent responds by either giving an explanation for or declaring that the debate is over. In the latter case, a judge attempts to verify that is true. If she succeeds, the first agent wins the debate; if not, the second agent wins the debate.[5] The first agent must end the game after a finite number of steps.
The debaters are maximally powerful agents; the judge is a human.

Just as HCH abstracts away implementation details of stock IDA, Ideal Debate abstracts away implementation details of Debate.[6]

Instead of having implication statements, one could have allowed the second agent to deny the fact of the implication, as a 'special move' of sorts. However, I think that would be a mistake; one of the take-away messages from this post is that there is no sharp line between implications and other statements. Both can be disputed and argued about further, which is why we're treating them as the same type of object. In the definition of Ideal Debate above, the second agent is always free to point to (which is precisely the implication statement), and if so, the game proceeds normally, i.e., the first agent has to provide an explanation for as the next move.

Below is an example of what an Ideal Debate transcript might look like.[7] Here, the implication statement is not shown (but this is just to make it easier to draw – it should really be there at every level!), and the statement we recurse into is always the one that the second agent has pointed to.

You can also look at the uncluttered version.

A clarification: statements are not strings. By itself, the string 'Now, we have [...]" cannot be judged since it contains symbols whose meaning is only defined in the remaining transcript. In general, any statement may require an arbitrary amount of additional context to be understood. This means the length of its string doesn't meaningfully indicate how complex a statement is.

Despite this, the complexity of may remain relatively low. This is due to the principle we've discussed in post #-1. The better the explanations chosen by the first agent are, the less complexity there will be in each part of the argument. Ideally, the judge will only have to deal with a small part of the entire argument to verify .[8]

2. Cognition Spaces

One of the things mathematicians like to do when studying a problem is to define the space that one is working in. For example, in machine learning, one usually decides on a space of possible models before beginning the search. Consequently, we would now like to define a space and say that HCH and Ideal Debate are about doing stuff within this space. I will call this a Cognition Space.

As mentioned above, I think it is a mistake to differentiate between 'facts' and 'implications'. The prime number transcript is an example of this: it follows from the axioms of set theory that there are infinitely many prime numbers, so technically the entire debate is only about implications, but they just feel like regular statements. For this reason, we will take 'statement' to be a primitive and define a Cognition Space to be a pair

where is a set of statements, and a function assigning each statement a difficulty. On this, several points.

  • What measures is the difficulty of verifying that a statement is true, not of understanding what is being said. For example, in the Ideal Debate transcript shown above, the difficulty of the statement “Given any set of finitely many prime numbers, one can construct at least one prime number that is not a member of this set.” is likely quite high, even though it's fairly easy to understand what the claim is.
  • Since we do not differentiate between implications and facts, implication statements are regular members of . Consequently, a Cognition Space determines which approaches to explaining statements are difficult and which are easy. For example, say you're the first agent in Ideal Debate and have to explain the root statement . You might see two ways of doing this, either via and , or via and and . In this case, not only do these five statements have each a difficulty, but the set also includes two implication statements that precisely say .” and “.”, respectively. All five of the and the two implication statements are assigned a difficulty by .
  • The ground truth here is entirely based on the human , which is either the human in HCH or the judge in Ideal Debate. contains all things that she would consider statements, and difficulty is measured by how hard it is for her to verify a statement. Thus, the human entirely determines the Cognition Space, and we've put her in the subscripts of both and to serve as a reminder of that.

We now turn to Ideal Debate in particular. Given the definitions above, we define a path through a Cognition Space to be a pair

where [9] and for each , we have that [10] and for some ; with being just the initial answer to the input question. (Recall that the fat letters denote sequences, and that the notation is short for ' is an explanation for '.)

An Ideal Debate transcript (such as the prime number one shown above) precisely visualizes one path through a Cognition Space.[11] To make sure you're following the formalism up to this point, here's an exercise.

EXERCISE (1-5 MINUTES): If we extend to paths, what is the correct definition for the difficulty of a path?

The difficulty should equal that of since that's the only statement the judge needs to verify.

3. Finding Explanations

Ideal Debate traverses a Cognition Space by having the two agents collectively choose a single path that starts at an answer to the input question and ends at some statement . If the first agent is honest (and we will assume so for this post), she will give a true initial statement (the answer to the input question) and will provide a true explanation for the previous statement at every step of the game. The second agent will, presumably, try to navigate to the most difficult part of the argument, hoping that the judge will fail to verify . (The second post will look into this a lot more.)

Conversely, a node in an HCH tree is initialized with a question, which she will attempt to answer while using subtrees to answer related questions. If she succeeds at this, then she must consider [the set of (question, answer) pairs she has exchanged with subtrees plus whatever cognitive work she's done herself] to be an explanation for the answer. (Otherwise, she risks returning an answer that isn't true.)

Thus, explanations[12] are key in both cases; however, the purpose of the explanation is different:

  • in Ideal Debate, the statement that is explained is already known, and the explanation is meant to demonstrate that it is true; whereas
  • in HCH, the statement to be explained is not known, and the purpose of the explanation is to derive a [statement that answers the input question].

Put differently, Ideal Debate (if the first agent is honest) is analogous professor deciding which statements most cleanly demonstrate that such-and-such is the answer to a particular question, whereas HCH is analogous to a student trying to figure out how to answer the question in the first place, and asking subquestions to help with that. The difference between the resulting decompositions will vary – we can imagine questions where it is nonexistent, such as

In this case, we may end up with transcripts that look like this or this (with the usual disclaimers about hidden elements); they're functionally identical. But it's not hard to imagine cases where the difference is substantial. In fact, you need only take the prime number transcript above to have an example; without knowing the proof, many people would not think to ask, 'given any set of finitely many prime numbers, can you use them to construct a new prime number?', which means that an HCH transcript for the same question would look differently.

Let's take a shot at formalizing this difference. Recall our cognition space . Given some statement , the space implicitly represents the tree of all possible explanations for . It looks something like this:

Note that the structure of the tree is entirely determined by the existing implication statements: for each implication statement , there is an edge from to in the tree.

In reality, there are likely far more than three explanations for , probably with more components, and then each component of each explanation has itself many more explanations, and the components of those have explanations as well, and so on. It becomes very intractable very quickly.

Formally, write to denote the set of all possible explanations in for that exist in . (This set has precisely as many members as there are implication statements of the form “.”) We can now define our requirement that Ideal Debate agents be 'maximally powerful' as the ability to search through all members of . In this setting, the first agent will pick the one that will lead to the easiest possible path, given the adversarial nature of the second agent. (Again, we will look into this more in the second post.)

HCH is much harder to formalize, but for now, we can crudely model the limitations that come with not knowing the answer as the ability to

  • only search through a subset of ; and
  • being able to derive the answer iff there is a sufficiently easy explanation on offer.

Here's another exercise to make sure the formalism is clear.

EXERCISE (3-8 MINUTES): Suppose a node in HCH succeeds in answering a question if she finds an explanation such that , where is the correct answer to the input question. Suppose further that she can search through a hundred elements (randomly chosen) in to find such an explanation. Come up with a toy example where this would likely lead to her failing to derive , even though a simple explanation does exist. To do this, define a full cognition space .[13] You can choose arbitrary values; they need not correspond to anything real.

  • .
  • for all other .

Increase to make it arbitrarily unlikely for the human to find an explanation.

The prime number example shows that there are real cases where the difference is significant, and the formalism agrees.

So – an HCH tree whose human only has high school knowledge about math would not immediately guess the most elegant proof. Fortunately, it doesn't have to. HCH has a massive computational budget, so if it goes in a fruitless direction first, that's still fine, as long as it finds a correct proof eventually. Would it do that?

Who knows.[14]

However, it certainly isn't obvious that deriving a mathematical proof has the same asymptotic difficulty as understanding it, which is what it means to say that Factored Cognition is guaranteed to either work for both stock IDA and Debate or neither.

We can thus end this post on our first conjecture:

Decompositions are an essential part of any Factored Cognition scheme, and changing how they are chosen is entirely allowed to change how the scheme scales to harder problems.

In the next post, we'll see how much we can do with the formalism. This will not be conclusive, which is why we will then switch gears and turn to the human component.

  1. I say 'stock IDA' to refer to any implementation of IDA where a human is doing the decomposition. There are possible implementations where an agent is doing the decomposition or where there is no decomposition at all (those implementations don't rely on Factored Cognition). In its most general form, IDA is merely a template of a training scheme prescribing that there be two procedures called Distill and Amplify, and under this view, just about every training scheme is technically a variant of IDA (any method that uses gradient descent becomes an instance of IDA if we set [Amplification] = [Gradient Descent step] and [Distillation] = [Identity Function]), which is why we won't talk much about IDA in general.

    Note that stock IDA still leaves the implementation of the distillation step open. ↩︎

  2. Note that HCH is technically not a fully defined scheme, but a class of schemes , where defines the human component (what human, what environment, etc.), is a parameter in that specifies a time limit, and defines the communication channel (what kind of messages, what length, etc.).

    Given these parameters, we can define a node semi-formally like so:

    A node is a human with context specified in , initialized with some question . It exists for time .

    During this time, it can spawn another node with some question arbitrarily often; whenever it does, it immediately receives the output from that node.

    By the end of time , it needs to provide an output, obeying conditions governed by parameter .

    (Here, the nodes it can spawn are the same type of object as itself.)

    Then, the entire scheme is simply a node initialized with the scheme's input question.

    This definition always yields a tree of infinite depth. It corresponds to what Paul calls weak-HCH. I talk more about why this sequence looks at weak-HCH rather than strong-HCH in a later post; it's related to the concepts discussed in Hiding Complexity. ↩︎

  3. Details that have been abstracted away include:

    • Inner Alignment concerns: in the real world, may have an objective other than trying to approximate even if it was trained to do that.
    • Bounded depth: for any , the model only approximates an HCH tree of depth , not an infinite one.
    • Computational limitations: a trained model can only approximate an exponentially large tree insofar as the tree's computations can be done more cheaply with better algorithms. (This is the part we won't ignore since it's a hard limitation.)
  4. It being an explanation is a precise requirement; recall the definition above. ↩︎

  5. Note that the complement of 'the judge succeeds in verifying that is true' is not 'the judge succeeds in verifying that is false'. If the judge is uncertain, the second agent also wins the debate. ↩︎

  6. Details that have been abstracted away include:

    • Again Inner Alignment concerns: in real Debate, the agents may have motives that go beyond winning any one debate game (like trying to cause more debates to happen in the future)
    • Ambiguity: in reality, the meaning of statements can shift due to ambiguous words. This problem is the motivation for cross-examination.
    • Wireheading: with the setup as-is, the first agent could subtly delude the judge rather than playing honestly, and the second agent can't prevent this since she has much more restricted output channels.
    • Weak Debaters: although Debate agents should eventually become very powerful (as long as the training signal is accurate, the only limit is the power of the best machine learning techniques), they don't start off that way, and the scheme has to work about even at that point.
  7. As an aside: this example probably illustrates that it won't be the case that every human is competent enough to judge Debate transcripts. For example, it requires some degree of familiarity with mathematical notation and some competence in logical thinking. (And they need to understand English.) The relevant question is how the difficulty scales with the complexity of the question. ↩︎

  8. There are some striking examples of this principle in action in more difficult mathematical proofs. I may at some point dedicate a post to illustrating this. ↩︎

  9. Here, we're abusing the notation to mean ' appears in the sequence ', which is technically different from set membership. ↩︎

  10. The symbol denotes the set of all sequences of statements in . (This use of the asterisk is standard.) ↩︎

  11. Geoffrey Irving (inventor of the Debate scheme) said something functionally similar months ago on the AI alignment podcast: "A single debate transcript, in some sense, corresponds to a single path through the tree of amplification." ↩︎

  12. Usually, people talk about decompositions rather than explanations. As far as I'm concerned, they're synonyms: both terms refer to the set of substatements given by a debate agent/the set of (question, answer) pairs an HCH node exchanges with subtrees. I'm talking about explanations to emphasize the fact that they imply a specific statement. ↩︎

  13. Note that you don't need to define what the non-implication statements are, it's enough to postulate that they exist. You do need to define the implication statements since those determine which sequences are explanations. As an example, the following:

    • imply .”

    perfectly defines a Cognition Space. However, this is not a solution to the exercise since only has a single element. ↩︎

  14. This is a good time to mention that Ought may or may not be studying questions similar to this one. ↩︎

New Comment
6 comments, sorted by Click to highlight new comments since:

The breakdown from natural language to logic seems incomplete in respect to the level of mechanism required. In the S4 example there are pharses like S1:"this topic", S2:"those five", S3:"conversely" which do not make sense when read only on their own but refer to shared objects. The natural language is also strctured to be claim-evidence when logic structure is axioms-theorem.

Part of the point is that in the debate only the last "stop claim" needs to be checked by the judge. If I try to read only "is judged" arrowed part I can't judge it. What is y, a, b or a1, b1, an, bn? If the letters broken down into number subscripts is supposed to represent a unselected explanation it kinda fails as I have to refer to it to get subscript terms defined. Y is defined two levels up. This is fine if I have been following the generation of the tree but not if I want to have accessed only as few nodes as possible.

 Feels handwavy where the reader is left to make it technically competent if they are interested to do so.

Good observation. is not the same as the string that represents it in the transcript (the "Now we have [...]" in the example). I can see how the way I've written the post suggests that it is; this is a mistake.

(and statements in general) can depend on other concepts in the transcript. If cross-examination (which I think does have a place in Ideal Debate) is allowed, they can even depend on concepts that haven't been introduced in the transcript, and which the first agent only defines after the second agent pointed to . E.g., the first agent could just postulate that a set exists without saying anything about its properties until they matter for the argument. This is fine as long as there is a mechanism that forces the first agent to be consistent (which is exactly what cross-examination does).

In general, statements are not strings. The same string can map to several different statements if the words refer to different things, and the length of the corresponding string doesn't meaningfully restrict the actual complexity. I think of ambiguous words much like of concepts that haven't yet been defined. This is what I was getting at by saying that ambiguity has been abstracted away.

What this does suggest in the formalism as-is is that the difficulty (in terms of ) of statements will tend to go up if they are later in the debate since they depend on more prior concepts. (Does that mean the judge effectively has to deal with the entire problem after all? Absolutely not; the key move the first agent can pull over and over again is to hide most of the complexity of a concept. The second agent can only question one claim, so whenever she accepts that a certain thing exists, the first agent gets away with only defining the properties of that thing that are relevant for . E.g., in a more complicated mathematical proof, the first agent may say something like 'there is a function with properties and ', and if the second agent chooses not to doubt that this function exists, then the first agent gets away with never talking about how the function was constructed, which may exclude pages of work. This is why I've written post #-1.)

I've previously had the vague plan to make an additional post about this stuff after the main sequence. Maybe instead, what I've just said here should be in this post, not sure yet. I'll definitely change something to make it clear that . [Edit: for now, I added two paragraphs at the end of chapter 1.]

I've so far not thought about extending the formalism to model this explicitly. It's not an unreasonable idea, but my current sense is that it doesn't need to be included. Maybe I'll change my mind about this.

I like the idea of cognition space, but I'm not sure how I'm supposed to picture HCH in this context. Is it just a tree which never reaches too difficult explanations, and keeps going until (if) it reaches a simple enough explanation?

Yes. I don't think the formalism can say much about this process because it's so dependent on the human (that's why the second post is almost only about Debate), but that's the right picture.

Maybe you said it somewhere, but where do definitions of new terms and concepts fit in? Would that be a type of statement?

Not part of the model; see my reply to Slider. The assumption is that the first agent has to be consistent ("problem of ambiguity is ignored"), and the final statement (or any other statement) is not equal to the string but to the real claim that talks about however many concepts.