Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Idealized Factored Cognition

7Slider

2Rafael Harth

4adamShimi

2Rafael Harth

4Steven Byrnes

4Rafael Harth

New Comment

6 comments, sorted by Click to highlight new comments since: Today at 4:12 PM

The breakdown from natural language to logic seems incomplete in respect to the level of mechanism required. In the S4 example there are pharses like S1:"this topic", S2:"those five", S3:"conversely" which do not make sense when read only on their own but refer to shared objects. The natural language is also strctured to be claim-evidence when logic structure is axioms-theorem.

Part of the point is that in the debate only the last "stop claim" needs to be checked by the judge. If I try to read only "is judged" arrowed part I can't judge it. What is y, a, b or a1, b1, an, bn? If the letters broken down into number subscripts is supposed to represent a unselected explanation it kinda fails as I have to refer to it to get subscript terms defined. Y is defined two levels up. This is fine if I have been following the generation of the tree but not if I want to have accessed only as few nodes as possible.

Feels handwavy where the reader is left to make it technically competent if they are interested to do so.

Good observation. is *not* the same as the string that represents it in the transcript (the "Now we have [...]" in the example). I can see how the way I've written the post suggests that it is; this is a mistake.

(and statements in general) can depend on other concepts in the transcript. If cross-examination (which I think does have a place in Ideal Debate) is allowed, they can even depend on concepts that *haven't* been introduced in the transcript, and which the first agent only defines after the second agent pointed to . E.g., the first agent could just postulate that a set exists without saying anything about its properties until they matter for the argument. This is fine as long as there is a mechanism that forces the first agent to be consistent (which is exactly what cross-examination does).

In general, statements are not strings. The same string can map to several different statements if the words refer to different things, and the length of the corresponding string doesn't meaningfully restrict the actual complexity. I think of ambiguous words much like of concepts that haven't yet been defined. This is what I was getting at by saying that ambiguity has been abstracted away.

What this does suggest in the formalism as-is is that the difficulty (in terms of ) of statements will tend to go up if they are later in the debate since they depend on more prior concepts. (Does that mean the judge effectively has to deal with the entire problem after all? Absolutely not; the key move the first agent can pull over and over again is to hide most of the complexity of a concept. The second agent can only question one claim, so whenever she accepts that a certain thing exists, the first agent gets away with only defining the properties of that thing that are relevant for . E.g., in a more complicated mathematical proof, the first agent may say something like 'there is a function with properties and ', and if the second agent chooses not to doubt that this function exists, then the first agent gets away with never talking about how the function was constructed, which may exclude pages of work. This is why I've written post #-1.)

I've previously had the vague plan to make an additional post about this stuff after the main sequence. Maybe instead, what I've just said here should be in this post, not sure yet. I'll definitely change something to make it clear that . [Edit: for now, I added two paragraphs at the end of chapter 1.]

I've so far not thought about extending the formalism to model this explicitly. It's not an unreasonable idea, but my current sense is that it doesn't need to be included. Maybe I'll change my mind about this.

(This post is part of a sequence that's meant to be read in order; see the preface.)## 1. HCH and Ideal Debate

Recall from post #-2 that we have two perspectives on stock IDA.

^{[1]}One is that of a human with access to a model, the other is that of an HCH tree.^{[2]}We can think of HCH as 'pure' or 'idealized' Factored Cognition that abstracts away implementation details,^{[3]}and the training procedure as trying to implement this ideal.One might now ask the following:

If HCH is the ideal of stock IDA, then what is the ideal of Debate? Since this is a sequence about Factored Cognition, we are primarily interested in analyzing the idealized versions, so this is the first question we'd like to answer.

Interlude on notation: throughout this sequence, we will need to refer to single statements, sequences of statements, and sets of statements. To make telling them apart as easy as possible, we follow the following norms:

We'll use the first two of those in the following definition that will be crucial in our discussion of Debate. If s is a statement and S=(s1,...,sn+1) a sequence of statements such that sn+1=“(s1,...,sn) imply s.”, we say that S is an

for s and denote this by writing Se→s. In this setting, sn+1 is what we call anexplanation: it precisely says that the statements (s1,...,sn) imply the statement sn+1. Here's a made-up example where s4 is the implication statement:implication statementThe purpose of having implication statements is to ensure that s

triviallyfollows from s1,s2,s3,s4 since the implication itself is among those statements (here s4). If you dispute s, you must dispute one of the si.Armed with this concept, we can define our idealization of the Debate scheme:

^{[4]}(s1,...,sn+1) for the answer. At every subsequent step, the second agent points to one of the statements sj in the explanation, j∈{1,...,n+1}, and the first agent responds by either giving an explanation for sj or declaring that the debate is over. In the latter case, a judge attempts to verify that sj is true. If she succeeds, the first agent wins the debate; if not, the second agent wins the debate.^{[5]}The first agent must end the game after a finite number of steps.Just as HCH abstracts away implementation details of stock IDA, Ideal Debate abstracts away implementation details of Debate.

^{[6]}Instead of having implication statements, one could have allowed the second agent to deny the fact of the implication, as a 'special move' of sorts. However, I think that would be a mistake; one of the take-away messages from this post is that there is no sharp line between implications and other statements. Both can be disputed and argued about further, which is why we're treating them as the same type of object. In the definition of Ideal Debate above, the second agent is always free to point to sn+1 (which is precisely the implication statement), and if so, the game proceeds normally, i.e., the first agent has to provide an explanation for sn+1 as the next move.

Below is an example of what an Ideal Debate transcript might look like.

^{[7]}Here, the implication statement is not shown (but this is just to make it easier to draw – it should really be there at every level!), and the statement we recurse into is always the one that the second agent has pointed to.You can also look at the uncluttered version.

A clarification:statements are not strings. By itself, the string 'Now, we have y=a⋅b [...]" cannot be judged since it contains symbols whose meaning is only defined in the remaining transcript. In general, any statement may require an arbitrary amount of additional context to be understood. This means the length of its string doesn't meaningfully indicate how complex a statement is.Despite this, the complexity of sfinal may remain relatively low. This is due to the principle we've discussed in post #-1. The better the explanations chosen by the first agent are, the less complexity there will be in each part of the argument. Ideally, the judge will only have to deal with a small part of the entire argument to verify sfinal.

^{[8]}## 2. Cognition Spaces

One of the things mathematicians like to do when studying a problem is to define the space that one is working in. For example, in machine learning, one usually decides on a space of possible models before beginning the search. Consequently, we would now like to define a space and say that HCH and Ideal Debate are about doing stuff within this space. I will call this a

.Cognition SpaceAs mentioned above, I think it is a mistake to differentiate between 'facts' and 'implications'. The prime number transcript is an example of this: it follows from the axioms of set theory that there are infinitely many prime numbers, so technically the entire debate is only about implications, but they just feel like regular statements. For this reason, we will take 'statement' to be a primitive and define a Cognition Space to be a pair

(Sh,dh)

where Sh is a set of statements, and dh:Sh→R+ a function assigning each statement a difficulty. On this, several points.

notof understanding what is being said. For example, in the Ideal Debate transcript shown above, the difficulty of the statement “Given any set of finitely many prime numbers, one can construct at least one prime number that is not a member of this set.” is likely quite high, even though it's fairly easy to understand what the claimis.for herto verify a statement. Thus, the human entirely determines the Cognition Space, and we've put her in the subscripts of both Sh and dh to serve as a reminder of that.We now turn to Ideal Debate in particular. Given the definitions above, we define a

to be a pairpath through a Cognition Space (Sh,dh)(((s0),S1,...,Sn),sfinal)

where sfinal∈Sn

^{[9]}and for each j∈{1,...,n}, we have that Sj∈S∗h^{[10]}and Sje→s for some s∈Sj−1; with S0:=(s0) being just the initial answer to the input question. (Recall that the fat letters denotesequences, and that the notation Sje→s is short for 'Sj is an explanation for s'.)An Ideal Debate transcript (such as the prime number one shown above) precisely visualizes one path through a Cognition Space.

^{[11]}To make sure you're following the formalism up to this point, here's an exercise.EXERCISE (1-5 MINUTES):If we extend dh to paths, what is the correct definition for the difficulty of a path?The difficulty should equal that of sfinal since that's the only statement the judge needs to verify.

## 3. Finding Explanations

Ideal Debate traverses a Cognition Space by having the two agents collectively choose a single path that starts at an answer s0 to the input question and ends at some statement sfinal. If the first agent is honest (and we will assume so for this post), she will give a true initial statement (the answer to the input question) and will provide a true explanation for the previous statement at every step of the game. The second agent will, presumably, try to navigate to the most difficult part of the argument, hoping that the judge will fail to verify sfinal. (The second post will look into this a lot more.)

Conversely, a node in an HCH tree is initialized with a question, which she will attempt to answer while using subtrees to answer related questions. If she succeeds at this, then she must consider [the set of (question, answer) pairs she has exchanged with subtrees plus whatever cognitive work she's done herself] to be an explanation for the answer. (Otherwise, she risks returning an answer that isn't true.)

Thus, explanations

^{[12]}are key in both cases; however, thepurposeof the explanation is different:demonstratethat it is true; whereasderivea [statement that answers the input question].Put differently, Ideal Debate (if the first agent is honest) is analogous professor deciding which statements most cleanly demonstrate that such-and-such is the answer to a particular question, whereas HCH is analogous to a student trying to figure out how to answer the question in the first place, and asking subquestions to help with that. The difference between the resulting decompositions will vary – we can imagine questions where it is nonexistent, such as

Q:=“What is 987⋅123?"

In this case, we may end up with transcripts that look like this or this (with the usual disclaimers about hidden elements); they're functionally identical. But it's not hard to imagine cases where the difference is substantial. In fact, you need only take the prime number transcript above to have an example; without knowing the proof, many people would not think to ask, 'given any set of finitely many prime numbers, can you use them to construct a new prime number?', which means that an HCH transcript for the same question would look differently.

Let's take a shot at formalizing this difference. Recall our cognition space (Sh,dh). Given some statement s∗, the space implicitly represents the

tree of all possible explanationsfor s∗. It looks something like this:Note that the structure of the tree is entirely determined by the existing implication statements: for each implication statement s′=“(sj,sk) imply s.”∈STh, there is an edge from (sj,sk,s′) to s in the tree.

In reality, there are likely far more than three explanations for s, probably with more components, and then each component of each explanation has itself many more explanations, and the components of those have explanations as well, and so on. It becomes very intractable very quickly.

Formally, write Explanations(s) to denote the set of all possible explanations in for s that exist in dh. (This set has precisely as many members as there are implication statements of the form “⟨arbitrary sequence⟩ imply s.”) We can now define our requirement that Ideal Debate agents be 'maximally powerful' as the ability to search through all members of Explanations(s). In this setting, the first agent will pick the one that will lead to the easiest possible path, given the adversarial nature of the second agent. (Again, we will look into this more in the second post.)

HCH is much harder to formalize, but for now, we can crudely model the limitations that come with not knowing the answer as the ability to

Here's another exercise to make sure the formalism is clear.

EXERCISE (3-8 MINUTES):Suppose a node in HCH succeeds in answering a question if she finds an explanation (s1,...,sn+1)∈Explanations(s0) such that ∑n+1j=1dh(sj)≤100, where s0 is the correct answer to the input question. Suppose further that she can search through a hundred elements (randomly chosen) in Explanations(s0) to find such an explanation. Come up with a toy example where this would likely lead to her failing to derive s0, even though a simple explanation does exist. To do this, define a full cognition space (Sh,dh).^{[13]}You can choose arbitrary values; they need not correspond to anything real.Increase n to make it arbitrarily unlikely for the human to find an explanation.

The prime number example shows that there are real cases where the difference is significant, and the formalism agrees.

So – an HCH tree whose human only has high school knowledge about math would not immediately guess the most elegant proof. Fortunately, it doesn't have to. HCH has a massive computational budget, so if it goes in a fruitless direction first, that's still fine, as long as it finds a correct proof eventually. Would it do that?

Who knows.

^{[14]}However, it certainly isn't

obviousthatderivinga mathematical proof has the same asymptotic difficulty asunderstandingit, which is what it means to say that Factored Cognition is guaranteed to either work for both stock IDA and Debate or neither.We can thus end this post on our first conjecture:

Decompositions are an essential part of any Factored Cognition scheme, and changing how they are chosen is entirely allowed to change how the scheme scales to harder problems.

In the next post, we'll see how much we can do with the formalism. This will not be conclusive, which is why we will then switch gears and turn to the human component.

I say 'stock IDA' to refer to any implementation of IDA where a human is doing the decomposition. There are possible implementations where an agent is doing the decomposition or where there is no decomposition at all (those implementations don't rely on Factored Cognition). In its most general form, IDA is merely a template of a training scheme prescribing that there be two procedures called

andDistill, and under this view, just about every training scheme is technically a variant of IDA (any method that uses gradient descent becomes an instance of IDA if we set [Amplification] = [Gradient Descent step] and [Distillation] = [Identity Function]), which is why we won't talk much about IDA in general.AmplifyNote that stock IDA still leaves the implementation of the distillation step open. ↩︎

Note that HCH is technically not a fully defined scheme, but a class of schemes {HCHh,t,ℓ}, where h defines the human component (what human, what environment, etc.), t is a parameter in R+ that specifies a time limit, and ℓ defines the communication channel (what kind of messages, what length, etc.).

Given these parameters, we can define a

semi-formally like so:nodeA node is a human with context specified in h, initialized with some question q. It exists for time t.During this time, it can spawn another node with some question q′ arbitrarily often; whenever it does, it immediately receives the output from that node.By the end of time t, it needs to provide an output, obeying conditions governed by parameter ℓ.(Here, the nodes it can spawn are the same type of object as itself.)

Then, the entire scheme is simply a node initialized with the scheme's input question.

This definition always yields a tree of infinite depth. It corresponds to what Paul calls weak-HCH. I talk more about why this sequence looks at weak-HCH rather than strong-HCH in a later post; it's related to the concepts discussed in Hiding Complexity. ↩︎

Details that have been abstracted away include:

It being an explanation is a precise requirement; recall the definition above. ↩︎

Note that the complement of 'the judge succeeds in verifying that sj is true' is not 'the judge succeeds in verifying that sj is false'. If the judge is uncertain, the second agent also wins the debate. ↩︎

Details that have been abstracted away include:

As an aside: this example probably illustrates that it won't be the case that

everyhuman is competent enough to judge Debate transcripts. For example, it requires some degree of familiarity with mathematical notation and some competence in logical thinking. (And they need to understand English.) The relevant question is how the difficulty scales with the complexity of the question. ↩︎There are some striking examples of this principle in action in more difficult mathematical proofs. I may at some point dedicate a post to illustrating this. ↩︎

Here, we're abusing the s∈S notation to mean 's appears in the sequence S', which is technically different from set membership. ↩︎

The symbol S∗h denotes the

set of all sequencesof statements in Sh. (This use of the asterisk is standard.) ↩︎Geoffrey Irving (inventor of the Debate scheme) said something functionally similar months ago on the AI alignment podcast: "A single debate transcript, in some sense, corresponds to a single path through the tree of amplification." ↩︎

Usually, people talk about

rather than explanations. As far as I'm concerned, they're synonyms: both terms refer to the set of substatements given by a debate agent/the set of (question, answer) pairs an HCH node exchanges with subtrees. I'm talking about explanations to emphasize the fact that they imply a specific statement. ↩︎decompositionsNote that you don't need to define what the non-implication statements are, it's enough to postulate that they exist. You do need to define the implication statements since those determine which sequences are explanations. As an example, the following:

perfectly defines a Cognition Space. However, this is not a solution to the exercise since Explanations(s0) only has a single element. ↩︎

This is a good time to mention that Ought may or may not be studying questions similar to this one. ↩︎