# Ω 9

Personal Blog
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We present a -definable probability distribution that satisfies Christiano's reflection schema for its own defining formula. The strategy is analogous to the chicken step employed by modal decision theory to obfuscate itself from the eyes of ; we will prevent the base theory from knowing much about , so that can be coherent over and also consistently believe in reflection statements. So, the method used here is technical and not fundamental, but it does at least show that limit-computable and reflective distributions exist. These results are due to Sam Eisenstat and me, and this post benefited greatly from extensive notes from Sam; any remaining errors are probably mine.

Prerequisites: we assume familiarity with Christiano's original result and the methods used there. In particular, we will freely use Kakutani's fixed point theorem. See Christiano et al.'s paper.

### Outline

• Section 1. Introduction and problem statement
• Section 2. A definable, self-reflective distribution
• Section 3. A -definable, self-reflective distribution
• Section 4. Discussion and open problems

## Section 1. Problem statement

### Probabilistic reflection

We have some base theory in a language , where is able to talk about arithmetic (e.g. or ). We wish to find probability distributions over completions of , or equivalently functions satisfying implies and probabilistic coherence conditions like . In particular, we want to have accurate beliefs about itself:

where is a symbol in . Christiano showed that there exists such a distribution . In other words, taking an additional symbol in the metalanguage, writing for the statement that is a coherent distribution over , and for the statement that is reflective, we have that the theory

is consistent. That consistency of this theory is equivalent to the existence of a reflective requires an argument due to Fallenstein, wherein we take the standard part of according to some completion of this theory. We will use this idea in Section 3 and explain it there.

### Definable reflection

Christiano asked: Can we find a distribution that is definable, and furthermore reflective about its own defining formula?
This is different from the original reflective because we had a symbol in the language, and reflected itself in that symbol; in that case, speaking imprecisely, knew that assigned the same probabilities that assigned. Now we want a formula that defines a distribution, and that distribution assigns probability 1 to certain true statements about what probabilities assigns. ( stands for -Self-reflective.)

Formally, can we have a coherent distribution on defined by some formula in the language , where now does not contain a distinguished symbol , satisfying:

Our strategy is to apply Kakutani's fixed point theorem to the following reflection correspondence on the set of coherent distributions over . For any , we write and say that reflects in if and only if, whenever is in some interval, believes that assigns to a value in that interval. Formally, for all , we have if and only if:

If we can obtain a fixed point of this correspondence , and have define this same distribution, then we will have a definable self-reflective distribution.

(The notation does not denote an ordering, and is meant to suggest a single point on the left with a larger corresponding set on the right; better recommendations are welcome. Also, in the following section, and are actual distributions, not symbols in a language; this is distinguished by their arguments having the type of formulas rather than Gödel numbers of formulas.)

Most of the conditions for Kakutani's theorem are satisfied by for any formula just as in Christiano's proof, so we won't verify them in detail. For example, the set of distributions reflect in is still convex, because interpolation preserves coherence and reflection. (I can make an auxiliary post if people would like to see the details.) However, we need to satisfy the crucial non-emptiness condition:

In the next section, we give a definition that meets this criterion, assuming a strong background theory such as . This will illustrate the diagonalization method we will apply. Section 3 gives a that uses the same strategy as , but is -definable in arithmetic, i.e. computable relative to a halting oracle. Section 4 discusses the meta-theoretic assumptions used, optimality of our results, and open questions.

## Section 2. A definable, self-reflective distribution Ψ

Consider the theory stating that behaves according to . If is a consistent theory, then we can take a completion of , and then take the 0-1-valued distribution , i.e. the indicator function for membership in the complete theory . This is coherent over , and furthermore by fiat assigns probability 1 to each reflection statement for by .

In the original existence proof of a reflective , the analogous reflection theory for a symbol is consistent: the distribution itself provides a model for . However, in the present case, we have to be more careful. Perhaps proves some non-trivial facts about the behavior of the function we are defining. Then may be inconsistent, if proves that cannot behave according to , e.g. if but .

To overcome this, we will obfuscate the behavior of from the prying proofs of , to such an extreme extent that cannot rule out any finite set of behaviors of ; then will be consistent. (Unfortunately, this weakens the main potential use in reflective agents of having a definable self-reflective distribution, namely, that the theory can reason concretely about the distribution, other than what can be said about generic coherent distributions. It may be possible to have the base theory prove more about the distribution by not diagonalizing against certain behaviors, and working around that in the application of Kakutani's theorem.)

### Definition of Ψ

Take the set of possible finite specifications of behaviors for . Here is some enumeration of all sentences in . Note that is countable, and each sentence in is consistent with being a function, though not necessarily a coherent distribution. Now we can define by the following informally stated construction, which can be translated into a formula defining the graph of a function from sentences to real numbers:

Step 1. Search for a proof in of the negation of some statement in . If such a statement is found, then output some value in on input , and otherwise output 0.

Step 2. Otherwise, output values according to the least fixed point of the reflection correspondence .

By the least fixed point, we mean the fixed point with minimal first coordinate; minimal second coordinate out of those with minimal first coordinate; minimal third coordinate . . . etc. This exists because the set of fixed points is closed, and therefore also compact. Note that we require and to be expressive enough to define fixed points of and the ordering just described. Taking suffices. This is not a crucial point, as we will later show that a much weaker theory suffices.

### Theorem 1.

If is a sound theory, then defines a function on that assigns probability 1 to each sentence in , is coherent, and is reflective.

#### Proof.

Suppose we have for some , where . Then by step 1 of , we have that for all . This contradicts soundness of .

Thus, by compactness, we have that for all , the theory is consistent. Then by the argument above, the correspondence gives a non-empty image for each point in . By Kakutani's theorem, there is a fixed point of this correspondence.

Step 2 then outputs probabilities according to some such fixed point . That is, . Furthermore, , i.e. is reflective.

In fact, we do not need to assume soundness of in Theorem 1; we only need to assume that is consistent. Proof: suppose that for some . Then there is a least such . But then can prove that is the first sentence in that refutes, so can prove that finds this proof and outputs according to . Then we have that both proves and refutes that behaves according to , contradicting the consistency of .

## Section 3. A Δ2-definable, self-reflective distribution Λ

Step 1 in as defined above is -definable, i.e. computable with a halting oracle: we just need to check the halting of the machine that searches for refutations by of the (computably recognizable) sentences in . If the machine halts, find the proof computably, and output rational numbers given by the refuted behavior.

Step 2, however, uses set-theoretic notions; namely, it quantifies over the large and complicated set of fixed points of . In this section we replace Step 2 with a computation, thereby defining a formula that is self-reflective. ( stands for Limit-computable.)

### Definition of Λ

By the arguments for Theorem 1, regardless of what does in Step 2, there will be some such that . Thus, for Step 2, it suffices to run a computation that outputs a fixed point as long as one exists. We now define by the following construction, which can be viewed as an algorithm executed by a machine with a halting oracle:

Step 1. Search for a proof in of the negation of some statement in . If such a statement is found, then output some value in on input , and otherwise output 0.

Step 2. Otherwise, enumerate a completion of the theory in the language of set theory plus a constant symbol . That is, list all sentences , and print out each that is consistent with and the finitely many sentences that have already been printed. Given a sentence and a precision , search for the first quadruple of rational numbers such that , and such that . Output .

(Note: This is the technique alluded to earlier of taking the standard part of a distribution. If we just output with , then we might output e.g. for all , which has empty intersection. Also note that we could take a much more basic theory than ; we could have a base theory that says all true atomic statements about , and take and to be schema of quantifier-free formulas.)

#### Observation: Λ is Δ2.

Proof: Step 1 is by the argument given at the beginning of this section. Step 2 is computable, except that we need to check whether each new sentence is consistent with and the previously printed sentences. To check the consistency of a computably enumerable theory , we ask our halting oracle whether not halts, where is a machine that enumerates all the theorems of , and halts iff it prints .

### Theorem 2.

defines a coherent, self-reflective distribution.

#### Proof.

By the argument for , if is consistent, it does not refute any behavior of . So there is some fixed point , and also behaves according to Step 2.

The distribution provides an interpretation for the symbol satisfying coherence over and self-reflection in . Hence the theory is consistent, and so is non-trivial. By construction, for each there is a unique (standard) real that, for all , lies in the interval output by on and .

Now we argue that is coherent over and reflects itself in . For any , since , we also have . Thus, for all we have that , and therefore . A very similar argument works to show the other coherence conditions and reflection. For example, if , then we have . By an instance of , also , and hence .

Thus computes a coherent distribution over which assigns probability 1 to each reflection statement for .

## Section 4. Discussion

is roughly optimal for the properties of ; computable distributions can't even be coherent over essentially undecidable theories. (If we could compute a coherent distribution over , then for each we could wait for 1 or 0 to be excluded from the intervals we output for , and thereby computably separate the provable and refutable sentences, which is impossible.) A slightly stronger statement is that for any definable completion of , there is a distribution over completions of that is computable from , which is reflective for the definition of that uses in Step 2 to complete .

As mentioned earlier, is very obfuscated from . In fact, if is strong enough, then . Otherwise, since believes each sentence of , by coherence and , we would have that . But this is not possible for a reflective, coherent distribution.

If our metatheory is strong enough to prove some facts like Kakutani's theorem, then proves and . This likely roughly optimal: if the meta-theory even proves , then it also proves . Probably is sufficient to prove the necessary first-order statements, since the second-order theory is probably strong enough to prove a literal statement of Kakutani's theorem, and is the first-order part of . See for example Fixed point theory in weak second-order arithmetic, by Naoki Shioji and Kazuyuki Tanaka , though that paper only proves Kakutani in finite dimensions and the Tychonoff-Schauder theorem (in the weaker system ).

### Questions

• We conjecture that a modification of might give a distribution that satisfies the same properties, and also assigns probability 1 to the statement .
• We also conjecture that Fallenstein's reflection principle for expectations can be similarly defined.
• Can there be a distribution that is reflective and computably approximable (from below)? (This may be easy to refute.)
• Can a definable reflective distribution be coherent over a theory that proves some useful facts about the distribution?
• What are some weaker notions of reflection that can be computable or computably approximable (from below)?
• What reflective properties of beliefs are needed for self-verification in the context of general decision-making?

Open questions (Christiano):

• What sorts of bad beliefs can be avoided given coherence and reflection?
• What are some weaker notions of reflection such that probability distributions can be coherent, reflective, and also assign high probability to their own reflectivity?