Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Prerequisite: Agents that can predict their Newcomb predictor

(This post is from Benja's backlog of topics that should appear here, which he's deputized me to write up. The original idea was due to Jessica Taylor.)

The phenomena of agent-simulates-predictor or agent-trusts-predictor, where one has stronger axioms and the other has stronger powers of deduction, can be done at various levels of the tradeoff between realistic examples and tractable proofs. In the previous post, I considered the version that uses a bounded proof search on one hand, and a halting oracle on the other. There's a finite version where one process uses a bounded proof search with a much higher bound than the other process. But in this post, I'll discuss the clean infinite version where we pit a halting oracle against an oracle for the next level of the arithmetical hierarchy of Peano Arithmetic.

Here, we'll use two different formal systems built on Peano Arithmetic.

A statement of PA is any statement of the form , where uses only bounded quantifiers. (For example, the claim that PA is consistent is a sentence, asserting that every natural number is not the Gödel number of a proof of a contradiction.)

is PA plus the (non-decidable in PA) axioms consisting of every statement which is true in the standard model of PA. We can think of it as a formal system which can make a call in finite time to a standard halting oracle.

(Note that there's a different meaning of for ZFC, involving the analogous definitions for set theory. In order to avoid that ambiguity, we could instead just use the formal system , which is PA plus the axiom schema for every . However, in that case we'd need to go through a proof that also knows that is sound.)

We'll have the exact same results as in the last post, so we'll prove the analogous lemma, showing that the stronger axiom system (here, ZFC) can prove that statements of a certain type hold if and only if they are provable by an oracle with weaker axioms but stronger computation (here, )

Lemma: Let be a propositional combination of statements. Then .

Proof: First, we note that for any , because ZFC knows about models of , and thus knows it is sound.

Now we show that if is a propositional combination of statements, then . Express in disjunctive normal form; it is thus clear that it suffices to show this for that are conjunctions of statements and negations of statements. Now if such a holds, then all of its substatements must hold, and thus these are axioms of ; also, all of the negations of statements are of the form where has only bounded quantifiers, and if this holds, indeed there must be such an for each substatement, and thus will successfully prove . (The inductive argument for this reasoning can be carried out in ZFC.)

Now we can show the same two cases, with the same proofs as before:

Case 1: Agent Trusts Predictor

The agent uses modal UDT and a halting oracle for ZFC; Omega uses the second-level oracle for . Note that the statements "" and "" are each propositional combinations of statements about Gödel numbers.

Since we (from outside the problem) believe that ZFC is consistent and we know by the Lemma that Omega will predict correctly, the agent will fail to prove that it can get $1M + $1K by two-boxing or by one-boxing, and will fail to prove that it can get $1M by two-boxing. Then, using the Lemma, it can prove that if it one-boxes, it will get $1M, and so it does.

Case 2: Agent Simulates Predictor

The agent now uses modal UDT as well as a second-level halting oracle for ; Omega uses a halting oracle for ZFC.

Now, as before, the Lemma implies that Omega can prove that , and that , so it proves and does not fill the box; then the agent two-boxes.

New Comment