Would you like to see a primer on several MIRI research topics (assuming only the background of having taken a course with proofs in math or computer science)? Or are you curious why MIRI does so much with mathematical logic, and why people on Less Wrong keep referring to Löb's Theorem?

If you answered yes to either question, you may be interested in my lecture notes, An Introduction to Löb's Theorem in MIRI Research! These came out of an introductory talk that I gave at a MIRIx workshop.

Since I've got some space here, I'll just copy and paste the table of contents and the introduction section...

## Contents

**1 Introduction **

**2 Crash Course in Löb's Theorem**

2.1 Gödelian self-reference and quining programs

2.2 Löb's Theorem

**3 Direct Uses of Löb's Theorem in MIRI Research **

3.1 “The Löbstacle”

3.2 Löbian cooperation

3.3 Spurious counterfactuals

**4 Crash Course in Model Theory **

4.1 Axioms and theories

4.2 Alternative and nonstandard models

**5 Uses of Model Theory in MIRI Research **

5.1 Reflection in probabilistic logic

**6 Crash Course in Gödel-Löb Modal Logic **

6.1 The modal logic of provability

6.2 Fixed points of modal statements

**7 Uses of Gödel-Löb Modal Logic in MIRI Research **

7.1 Modal Combat in the Prisoner’s Dilemma

7.2 Modal Decision Theory

**8 Acknowledgments**

## 1 Introduction

This expository note is devoted to answering the following question: why do many MIRI research papers cite a 1955 theorem of Martin Löb, and indeed, why does MIRI focus so heavily on mathematical logic? The short answer is that this theorem illustrates the basic kind of self-reference involved when an algorithm considers its own output as part of the universe, and it is thus germane to many kinds of research involving self-modifying agents, especially when formal verification is involved or when we want to cleanly prove things in model problems. For a longer answer, well, welcome!

I’ll assume you have some background doing mathematical proofs and writing computer programs, but I won’t assume any background in mathematical logic beyond knowing the usual logical operators, nor that you’ve even heard of Löb’s Theorem before.

To motivate the mathematical sections that follow, let’s consider a toy problem. Say that we’ve designed Deep Thought 1.0, an AI that reasons about its possible actions and only takes actions that it can show to have good consequences on balance. One such action is designing a successor, Deep Thought 2.0, which has improved deductive abilities. But if Deep Thought 1.0 (hereafter called DT1) is to actually build Deep Thought 2.0 (DT2), DT1 must first conclude that building DT2 will have good consequences on balance.

There’s an immediate difficulty—the consequences of building DT2 include the actions that DT2 takes; but since DT2 has increased deductive powers, DT1 can’t actually figure out what actions DT2 is going to take. Naively, it seems as if it should be enough for DT1 to know that DT2 has the same goals as DT1, that DT2’s deductions are reliable, and that DT2 only takes actions that it deduces to have good consequences on balance.

Unfortunately, the straightforward way of setting up such a model fails catastrophically on the innocent-sounding step “DT1 knows that DT2’s deductions are reliable”. If we try and model DT1 and DT2 as proving statements in two formal systems (one stronger than the other), then the only way that DT1 can make such a statement about DT2’s reliability is if DT1 (and thus both) are in fact unreliable! This counterintuitive roadblock is best explained by reference to Löb’s theorem, and so we turn to the background of that theorem.