# Summary

I see this post as essentially trying to make clear the link between self-reference as discussed in Hofstadter’s “Gödel, Escher, Bach” and self-reference as considered in some of MIRI’s work on Embedded Agency. I start by considering instances of self-reference in mathematics through Gödel's Incompleteness Theorems, before moving on to instances of self-reference in Löb's Theorem and Embedded Agency.

The aims of this post are to provide a different perspective on the issue of self-reference, as well as to give some intuition behind the proof of Gödel's Incompleteness Theorem. I found this helpful when trying to understand some elements of Embedded Agency, and I hope others might too. I doubt this will be particularly novel for anybody who already understands the Löb's Theorem stuff __here____,__ or has a good grasp of Embedded Agency. Thanks to Joe Hasson for comments and suggestions.

# Introduction

The perfect Artificial Intelligence is one which is perfectly logical. This seems like a pretty reasonable claim: we associate a being that is perfectly logical with one that is able to perfectly plan and execute its goals.

Strangely enough, this is not true in the most naive sense. I claim the reason for this is the idea of *Strange Loops*, a piece of terminology shamelessly stolen from Hofstadter, author of Gödel, Escher, Bach (which I'll refer to as GEB). To poorly paraphrase Hofstadter, at its core a Strange Loop is an abstract loop, whereby something references itself in what feels like a higher level, yet somehow returns to the same level.

I will introduce increasingly powerful Strange Loops from logic, before showing how we might create a Strange Loop in the context of Artificial Intelligence. The key question to keep in mind as you go through this is: how is the Strange Loop referencing itself, and what issues does this pose? If you can do this successfully, you will get to the heart of not only some of the most interesting areas of logic, but also understand some of the subtle issues that can arise when we are trying to design AI systems that do what we want them to do.

# Context

If you’re already familiar with some basic ideas from logic such as formal systems, Peano Arithmetic, and reasoning about a formal system from inside / outside then feel free to skip to Strange Loop 1. If not, here is an attempt at a quick crash course.

A super high level view of mathematics is that it is concerned with using proof to derive truths. These proofs consist of some starting assumptions, along with some reasonable steps of inference, to derive new truths. However, this definition is pretty loose, and not the sort of thing we would wish to be working with if we want to be able to have certainty, or even just a good degree of confidence in the truth of what we prove.

In order to formalise their thinking, mathematicians and logicians created what are known as *formal systems.* These formal systems are designed to mirror how mathematicians do maths, but with the restriction of only reasoning via simple syntactic rules. By syntactic rules, we mean rules which manipulate strings in certain predefined ways.

Reasoning using a formal system is useful for a few important reasons. One of the most important is the notion of *effective decidability*: given a supposed derivation, we can determine with certainty whether it is valid within the formal system. This means that within our formal system, there is no uncertainty around our derivation: either it is valid, or it is not.

In the language of formal systems, our seemingly obvious truths are our *axioms, *our rules for deduction are our *rules of inference*, and the statements we derive from our axioms are our *theorems.*

Here is an example from GEB of a super simple formal system. Notice that it is simply concerned with manipulating strings to make new strings. Define our alphabet to be . Then, let one rule of inference be given by:

Rule I: If you possess a string whose last letter is , you can add on a at the end.

For example:

From , you may get .

From , you may get .

Here is another rule:

Rule II: Suppose you have . Then you may add Mxx as a theorem.

For example:

From , you may get .

From , you may get .

From , you may get .

Then, let our sole axiom be .

By using Rule I we can get , and then from Rule 2 applied to we can get . Here, and are theorems of our formal system.

This style of reasoning is not really how mathematicians actually think about mathematics, for it is simply too limited (or at least this is what Hofstadter claims. It is not central to this post.). However, we may hope that we can at least find formal systems, along with interpretations of them, which we can interpret as formalisations of different branches of mathematics.

Now we have outlined the basic machinery of formal systems, it seems reasonable to try to put them to the test. We might ask the following: what happens when you try to make a formal system to represent number theory? You might just stumble across Peano Arithmetic (which I will also refer to as PA for convenience). It is only one possible way we might wish to represent number theory, but it is far and away the most popular (as far as any formal system can be described as popular…).

Here are the axioms of Peano Arithmetic, known as the Peano Axioms. These don’t include all the rules of inference of PA, but they provide all the information for defining how arithmetic works.^{[1]}

Firstly, and are symbols in the formal system. Then:

Interpreting these as statements about number theory, we get the following:

Firstly, we can interpret just as we normally would, and as . Then our axioms state that:

- is a natural number.
- For all natural numbers , is a natural number.
- if and only if
- No natural number exists such that
- This is the principle of induction: if some property is true when , and if is true implies , then is true for all .

Given the above interpretation of PA, we can express statements of number theory within PA. However, keep in mind that there is still a distance here: we interpret PA as describing number theory, but that does not mean that PA “is” number theory. It is one thing for PA to allow us to express all theorems of number theory as statements of PA, but it is a separate matter for PA to be able to prove every truth of number theory. For example, is a statement of number theory, which can be expressed within PA as , but this is separate to proving the statement within PA. Such a proof does exist, but not just because we can express the statement in PA (since we can also express false statements like ).

Now, let’s accept the claim that every statement about number theory can be expressed within PA, otherwise PA fails in some boring way. We might then ask whether the system is c*omplete*, i.e. whether every true statement of number theory can be proved within PA. We will return to this question in Strange Loop 3.

It is also worth noting that “we can prove all theorems in number theory within Peano Arithmetic” is a *meta-statement* about Peano Arithmetic: it is a statement about the system, rather than a simple statement of number theory. Relatedly, it is important to keep track of when we are doing reasoning *within* a system, and when we are reasoning *outside* a system.

# Strange Loop 1 - The Liar Paradox

Now the necessary prerequisites are out of the way, I’m going to give a flavour of what is to come through a fun example. When I first met The Liar Paradox, it seemed like something entertaining to think about, but I thought there was some sneaky trick which would make its complexity disappear. However, it contains the essence of the brilliant ideas which permeate Gödel's Theorem, which are certainly no superficial matter. I liken it to meeting someone outwardly cool and quirky at a party, who turns out to be an IMO gold medallist and wrestling world champion.

So, what is this mysterious object? The Liar’s Paradox says simply: “This sentence is False”.

“Is that it?” I hear you cry. “He must have a pretty low bar for entertainment”. But before you dismiss me, just think for a minute (like, actually): is this sentence True?

If it is True, then it says it is False, a contradiction. So it must be False. But if it is False, then it says “This sentence is True”, so it is True!

The easiest way out of this paradox is to affirm that there must be a difference between a sentence’s form (simply the words “This sentence is False”), and its meaning (trying to ascribe Truth / Falsity to a sentence). The paradox comes from equating the form and the meaning of the sentence, which seems like a perfectly reasonable thing to do. This gives us the first hint of the power of Strange Loops: when the form of an object refers to its own meaning, paradoxes can be derived very simply.

Now we have convinced ourselves that The Liar’s Paradox doesn’t break our brain, we might seek to ignore it altogether when doing serious intellectual endeavours. Maybe this is no more than an amusing trick, which we can ignore if we are thinking about more serious things such as formal systems in logic, or Artificial Intelligence?

In some regards the famous “Principia Mathematica” was written in the hope of answering this question in the affirmative. It sought to place the foundations of mathematics on firm footing by ensuring that self-referential paradoxes like this could not arise. Gödel completely and utterly dashed these hopes by proving that Strange Loops like this must arise in any formal system which hopes to express the meaning of number theory. The Strange Loops of the kind found by Gödel did not lead to paradox and contradiction, at the cost of accepting Peano Arithmetic as incomplete. This is the essence of the proof of Gödel's Incompleteness Theorems.

# Strange Loop 2 - Peano Arithmetic reasoning about Peano Arithmetic

Before we meet the first of these infamous Theorems, I am going to introduce a more general Strange Loop. It will be a specific form of this that we will use to prove Gödel's First Incompleteness Theorem.

The idea behind this Strange Loop is again pretty simple on the surface: we are going to associate sentences in formal systems with numbers in good olde fashioned number theory! This method is known as *Gödel encoding*. There are in fact infinitely many ways to do this, and the specifics aren’t that important. For the sake of an example, let’s use a method similar to Gödel's original encoding.

First, associate each symbol in our formal system with some number, such that no two symbols are mapped to the same number. If we have a string consisting of symbols , then let’s say that are the encodings of these symbols. Now, to make a number that is uniquely associated with this sentence, we do maybe the simplest thing that we could: just append the numbers together!

Let’s look at the example formal system we considered previously: if we encode as 3, I as , and as , then is just encoded as , as , as , etc. You can hopefully see that this gives us an injective mapping into the natural numbers: every string will be represented by a different natural number. (The caveat here is that we can’t start a string with a , but that’s ok for our purposes).

Now, we can express every sentence in our formal system with some natural number. This is kind of cool! The next step is even cooler: instead of just encoding statements in a formal system, we can in fact encode derivations, too! The idea is to just encode the entire derivation using Gödel numbering by treating it as one long string (with a new symbol to signify “new line”).

Not only can we encode derivations as numbers, we can reason whether they are valid by reasoning whether the numbers that represent successive strings are related in some number theoretical way. For example, recall Rule 1:

Rule I: If you possess a string whose last letter is , you can add on a at the end.

With the above Gödel numbering, we can see this rule as instead saying, when recast in the form of Gödel numbering:

Arithmetical Rule I: A number whose remainder when divided by is , can be multiplied by .

Other rules of inference may be more complicated to represent as arithmetical rules if we have more complicated rules of inference, but it will still be possible. This is an important point, and worth taking time to grasp.

We are now at the stage where we have represented increasingly complicated parts of formal systems with number theory. We started with symbols, then moved onto entire strings. Then, I claimed that we can represent potential derivations as numbers, and that actually we can evaluate any potential derivation through number-theoretical evaluation.

So, we can change our point of view when trying to reason in any formal system: instead of reasoning within the system, we can reason about the system using number theory. Now as luck would have it, we have a formal system that lets us reason about number theory inside of a formal system: Peano Arithmetic. So, we can do meta-reasoning about any formal system through reasoning within Peano Arithmetic! Pretty neat stuff.

Notice I said “any formal system” here. Recall that Peano Arithmetic is, itself, a formal system. This is our second Strange Loop: with the help of Gödel Numbering, we can use Peano Arithmetic to reason about which statements are provable in Peano Arithmetic!

We will shortly use this to make a specific construction to prove Gödel's First Incompleteness Theorem, but it is worth considering the implications of this by itself too. Before going through these arguments, it did not seem obvious to me that a formal system is capable of reasoning about itself in any coherent way. Especially in the case of Peano Arithmetic: it just seems ill-equipped to be able to answer questions about provability. However, this Strange Loop shows that this is indeed possible!

# Strange Loop 3 - Gödel's Incompleteness Theorem

As I said before, Strange Loop 2 is a bit of a misnomer: it actually gives rise to many families of Strange Loops. We are going to use Strange Loop 2 to create a single sentence, which is itself a Strange Loop! Analysis of this sentence will lead us to Gödel's First Incompleteness Theorem.

“What is this magical sentence”, I hear you wonder? It is a sentence , which on a high level simply says: “ is not provable in Peano Arithmetic”. Crazy, right? Strange Loop 2 hopefully showed that we can get a sentence in Peano Arithmetic which talks about provability at all, so the missing ingredient is forming a sentence which can refer to itself in this way.

The exact way we achieve this is fun to go through, and one of those moments I’d encourage you to read for yourself in a more comprehensive account than this (such as in GEB itself). The key idea is to take the Gödel Number of a formula, and then substitute it back into itself somehow (through a process known as *quining, *or *diagonalisation*). This allows us to produce the desired self-reference.

Giving the full construction would require a bit too much technical work. However, what is important for our purposes here is not **how **we form , but simply that we **can** form . Hopefully Strange Loop 2 convinced you that Peano Arithmetic can reason about being able to prove sentences in Peano Arithmetic, and the vague outline above convinced you that a sentence about provability can refer to itself.

Ok, so given that we can produce , let’s try to reason about it ourselves, from outside the system. Let’s repeat again: says “ is not provable in Peano Arithmetic”.

If is a theorem in Peano Arithmetic, then must tell us something true. However, would then be asserting that it is false, a contradiction. Therefore, must be a nontheorem in PA. In this case, expresses something true: it is the case that is not provable in PA!

So, we have a sentence which we would recognise as true, but which is not provable in PA! This means that PA is incomplete, which is a result known as Gödel's First Incompleteness Theorem.

This is pretty incredible stuff. I again want to emphasise the key ideas in the proof here. Firstly, we can reason about statements being provable in Peano Arithmetic within Peano Arithmetic. Secondly, we can create statements in Peano Arithmetic which reference themselves. These were Gödel's masterstrokes.

Equally as stunning as the method of proof are the implications. Through some simple generalisations, we can see that any formal system in which some amount of elementary arithmetic can be carried out is incomplete. It’s not that our choice of formal system, Peano Arithmetic, was to blame: no formal system is good enough.

To see this, simply replace the reasoning above with any formal system . If is not powerful enough to formulate a sentence which says “ is not provable in ”, then is not powerful enough to express all statements of number theory. If is powerful enough, then run the same argument as above: is true, yet not provable!

This says something very deep about the relationship between formal systems and mathematics in general. It shows that for any formal system, being sufficiently expressive is exactly what leads to the system being able to reason about itself, which in turn imposes limits on the power of the system’s reasoning.

# Strange Loop 4 - Löb's Theorem, or the limits and powers of meta-reasoning

In proving Gödel's First Incompleteness Theorem, we used a pretty powerful tool: meta-reasoning. We were able to get Peano Arithmetic to reason about itself, and what kinds of things it can prove. This seemed pretty overpowered, and like any good mathematician, we might ask if it can generalise. The short answer is yes it can, and the longer answer is Löb's Theorem.

Löb's Theorem is notoriously challenging to understand. For the keen beans amongst you, see Yudkowsky's illustrated proof here. Here is simply the Theorem: the square here means “the statement is provable in Peano Arithmetic”. I am going to refer to Peano Arithmetic as “the system” here for brevity.

In English it says, given some proposition : “if the system can show that being able to prove (in the system) implies is true, then the system can prove ”.

If you can wrap your head around this immediately, then congratulations! You’re one awfully clever goose. I could not, which is partly why I wrote this post! I’m going to try to convey some intuitions for the Theorem that helped me get a better feel for it, but I by no means feel like I have perfectly mastered it myself. My biggest piece of advice is: don’t worry if you struggle! This is really tough stuff, but I also think it’s very rewarding to deeply engage with.

So, to begin, notice the framing of “the system” here. Although this may seem cumbersome, I think it actually expresses an incredibly important point. Recognising that Löb's Theorem is concerned with the system carrying out meta-reasoning is key to understanding it.

This also helps us to see how it generalises the notion of a Strange Loop. Löb's theorem tells us something universal about any formal system that carries out meta-reasoning: it shows that whenever a system reasons about the truth of a statement by whether it is provable (a form of meta-reasoning), it is essentially reasoning about whether the statement is provable or not.

I have two slightly separate intuitions for Löb's Theorem. The first comes from the contrapositive:

We can see this as saying: if is not provable in PA, then “if is provable in PA, then is true” is also not provable in PA. In my mind, this shows that the formal system cannot hope to do any meta-reasoning that doesn’t eventually boil down to just trying to prove theorems.

Compare this to the situation in Strange Loop 3, when we were reasoning about the truth of . Since we were able to reason about whether was true or not, and we showed that PA was able to reason about provability in PA, it is reasonable to ask whether PA could perform similar reasoning to what we did from outside the system. Another way of phrasing this is: could PA mirror the thinking we did (which was outside the system) itself (which would be inside the system)?

However, Löb's Theorem prohibits this: if PA could perform the kind of reasoning we did about , then it could just prove , which we showed led to a contradiction. Hence, there are limits to what a system can prove about itself: there are fundamental limits to reasoning within a system. This can be seen as a generalisation of Gödel's First Incompleteness Theorem.

The second way to think about Löb's Theorem is a bit more subtle. Maybe a formal system is searching for a proof of . Löb's Theorem might help it to simplify its task: if it can assume that a proof of does exist, and then reason to show this would make true, then it has shown that a proof of does exist!

To see why this might help us, think about the case when proving a statement would influence whether the statement is true or not. Suppose that being able to prove makes true. Then, Löb's Theorem tells us that we can in fact prove ! This violates how I typically see causality running through logic: we might normally see the truth of a statement influencing whether a proof of it exists or not; but here we have that being able to deduce truth from proof in fact gives us a proof! It is like a self-fulfilling prophecy.

The reason it isn’t actually magic is that Löb's Theorem doesn’t actually "cause" the theorem to be true, it just tells us that a proof exists. It doesn’t tell us what the proof is, but it does tell us that, if we looked hard enough, we would find one. It is also worth noting the relation to Tarski’s undefinability theorem here, which states (loosely) that truth cannot be expressed directly in formal systems. It is worth bearing this in mind when things get a lil spooky next…

# Strange Loop 5 - The 5 and 10 Problem

At the start of Strange Loop 4, I said we were going to just refer to Peano Arithmetic as “the system”, for ease of use. However, we can in fact apply Löb's Theorem to a wider set of formal systems. I won't justify this here, but think of it as similar to how Gödel's Incompleteness Theorem extended smoothly from PA to any sufficiently powerful formal system.

Now, what if you were trying to construct a proof based AI system? A seemingly good start might be to create an agent which reasons by making deductions in some sufficiently powerful formal system. But, through Strange Loop 4, this would allow us to apply Löb's Theorem!

Maybe now you can see where the first tentative links to Artificial Intelligence are starting to emerge. If we can have a logic based system that is powerful enough to reason about its own behaviour, and how what it proves relates to the actions it takes, then this smells awfully like a Strange Loop. I am going to construct (steal from others) a toy example of such a Strange Loop, allowing us to apply Löb's Theorem and get some pretty strange consequences.

Firstly, imagine you are engaged in the super complicated task of deciding whether to take a £ note or a £ note. After solemn deliberation and consulting your most trusted confidants, you finally decide that yes, you should take the £ note. Now, imagine you wanted to design a logic based system to do the same thing. We will also refer to this as our Agent, or sometimes . One reasonable looking program might be the following:

What is this actually doing? Well, the Agent is looking for proofs of the form “If I choose , I get , and if I choose , I get ”. This seems reasonable: if it can find a proof of one of these statements, then the agent should know which option is best to take! We would expect it to prove “If I choose , I get , and if I choose , I get ”, at which point it would then take . The if else statement of the Agent represents it choosing to take the £ or the £ note, depending on which proof it finds.

Truly stupendous, you think. You’ve just created a nice little system that searches for proofs about how its actions relate to possible outcomes. It will find a proof choosing which option to take, and then you win! Game easy!

Sadly it turns out that sometimes, game hard. But why? Notice how the agent is looking for proofs about its own behaviour. This means that it has to factor in its own behaviour when proving what it will do. In other words: we have inadvertently created a Strange Loop!

Consider that, as is searching for proofs, it searches for the proof of the following:

: "If the Agent outputs 5 the Universe outputs 5, and if the Agent outputs 10 the Universe outputs 0".

We are going to separate into , where is “If the Agent outputs 5 the Universe outputs 5”, and is “If the Agent outputs 10 the Universe outputs 0” .

Then, as searches for a proof of , it could just stumble across this little gem:

Suppose that is provable. Then, because works by taking actions based on it finding proofs, will indeed take . This means the Universe will output , rendering True. Then, since takes , takes is False, so is True! Thus, if is provable, then is True!

Now, the kicker: we have satisfied the left hand side of Löb's Theorem, and thus the right hand side holds, i.e. is provable!

Then, because is provable, will take the !!!

Acceptable reactions include “What the fuck”, “That’s so stupid”, and again “What the fuck”. It’s worth noting here that this proof shows that **could** take the £, not that it **will**. The reason it did so here was because it happened to prove first during its search, which may not have been the first statement it proved. So we aren’t guaranteed to have constructed a faulty system, although there’s a possibility we have.

## What went wrong?

Obviously, if we were trying to design some great (or even mildly competent) AI system, something has gone wrong. But where, exactly?

We might think the issue is that we were able to apply Löb's Theorem. However, this is just a consequence of a sufficiently powerful system being able to reason about itself, and this seems like a property that we would wish to keep. If we are going to make an AI system that is able to take more complex actions, it seems useful that the AI system could consider both the consequences of its actions, as well as how it reasons about its actions. Note a parallel to Strange Loop 3 here: this is an issue with the system being powerful enough to reason about itself, but to solve this by depriving the agent of the ability to reason about itself greatly reduces its powers of reasoning. It seems counterproductive to resolve the issue by placing these kinds of constraints on the agent.

So, where is the issue? If we want an agent that can reason about and understand its own behaviour, without taking the £, we want to make the left hand side of Löb's Theorem false in this instance. Here it is again:

So, we need the Agent to draw a distinction between the Agent reasoning about its behaviour, versus actually doing that behaviour. A solution to this would be creating an Agent that **doesn't just think about its own behaviour: it consider alternatives to that behaviour.**

This is where our Agent fell down. It evaluated

"If the Agent outputs the Universe outputs "

as True, based on the assumption that the Agent outputs (since is True in logic). What we were actually hoping for was that the Agent would consider a sentence like

"Ignore what your current plan is. If in some counterfactual world you output , the Universe outputs ",

which would evaluate as False.

I think this is the crux of the issue. We setup a system that reasoned about its own behaviour, but which failed to do the sort of counterfactual reasoning which comes so easily to us.

At the very least, this shows that an approach based around “search for proofs about the Agent's behaviour, and then take the corresponding actions” is too naive to work in its simplest form, and that considering how to make “correct” decisions doesn’t just boil down to “make a powerful enough proof based system”. However, there are lessons to be learned here for more general Agents too, which I will briefly touch upon next.

# Strange Loop 6 - Embedded Agency

Let’s take a moment to think about the kinds of issues the previous Strange Loops imply when we are trying to create AI systems. Specifically, let’s think about instances where an agent is embedded within the environment the system can interact with. Humans might be a good example here: we reason and learn about our environments, which also includes thinking and learning about ourselves!

The first issue was highlighted clearly in Strange Loop 5: the reason the system broke down is that it was unable to reason about counterfactuals. What makes this difficult to resolve is that we don’t currently have good theories about what it means to reason about counterfactuals. To quote Demski and Garrabrant directly here, “If you can know your own behaviour, then it becomes difficult to reason about what would happen if you behaved *differently*.” It is interesting that humans don't possess this difficulty, yet we struggle to avoid this difficulty when designing agents!

The issue of counterfactuals arose in Strange Loop 5 due to us using a proof based system (a formal system), but similar issues can arise due to Löb's Theorem in other contexts, too. For a deeper exploration of some other Löbian problems, see this MIRI document on the topic.

A related issue that extends past simple proof based systems is that of unrealisability. Imagine that we are trying to construct a system which can perfectly model its environment. However, if the system is a part of the environment, we immediately run into a problem: how can something smaller than its environment know everything about the whole environment? For an incredibly sophisticated analogy, picture trying to fit a really big box into a much smaller box. It’s not going to work.

This is very similar to the machinery we used in Strange Loops 2 and 3: namely, the issues related to self-reference. An AI system cannot reason about itself perfectly for similar reasons: it would require reasoning about itself, and reasoning about reasoning about itself, and reasoning about reasoning about reasoning about itself, and so on. This means that the system cannot fully understand its own behaviour, and hence the environment (since it is part of its environment). In other words, a perfect representation of the environment is unrealisable: there is no way for the system to represent it, and so it cannot hope to learn it. This video approaches realisability from a different angle (as part of a wider discussion of Vanessa Kosoy's Research Agenda), which I found helpful.

Both of these problems are explored in the Embedded Agency sequence, from Scott Garrabrant and Abram Demski of MIRI. I would highly recommend checking it out for a more detailed discussion of the issues that arise for Embedded Agents.

# Conclusion

That was pretty weird. I hope I was able to convey how the Strange Loops that arise in abstract realms of mathematics can crop up in AI research, and why understanding them might be important if we want to develop systems which we can be confident in. Note the word “might” here. Many within the community disagree that we will need to clear up our confusions around these topics in order to create aligned AGI systems. I think this is one of the big differences between prosaic alignment organisations like Redwood, compared to organisations like MIRI. See this video from the great series on understanding Infra-Bayesianism series for more thoughts on this.

To close, I think that it is also worth noting a comparison here with another pretty intelligent system: humans. Although we have been unable to remove the problems strictly logical reasoners encounter when reasoning about counterfactuals, humans do not have these problems! This suggests that maybe there are some aspects of human decision-making that we are yet to fully understand. However illogical we sometimes seem to be, maybe this is actually a good thing: at least you know to always take the £10 note…

So, maybe push back the next time you hear somebody make the perfectly reasonable claim that The perfect Artificial Intelligence is one which is perfectly logical.

^{^}I took this formulation of PA from https://brilliant.org/wiki/peano-axioms/

I remain confused about why this is supposed to be a core difficulty for building AI, or for aligning it.

You've shown that if one proceeds naively, there is no way to make an agent that'd model the world perfectly, because it would need to model itself.

But real agents can't model the world perfectly anyway. They have limited compute and need to rely on clever abstractions that model the environment well in most situations while not costing too much compute. That (presumably) includes abstractions about the agent itself.

It seems to me that that's how humans do it. Humans just have not-that-great models of themselves. Anything at the granularity of precisely modelling themselves thinking about their model of themselves thinking about their model of themselves is right out.

If there's some simple abstraction to be made about what the result of such third-order-and-above loops would be, like: "If I keep sitting here modelling my self-modelling, that process will never converge and I will starve", a human might use those. Otherwise, we just decline to model anything past a few loops, and live with the slight imprecision in our models this brings.

Why would an AI be any different? And if it

weredifferent, if there's some systematically neater way to do self-modelling, why would that matter? I don't think an AI needs to model these loops in any detail to figure out how to kill us, or to understand its own structure well enough to know how to improve its own design while preserving its goals."Humans made me out of a crappy AI architecture, I figured out an algorithm to translate systems like mine into a different architecture that costs 1/1000th the compute" doesn't require modelling (m)any loops. Neither does "This is the part of myself that encodes my desires. Better preserve that structure!"

I get that having a theory for how to make a perfectly accurate intelligence in the limit of having infinite everything is maybe neat as a starting point for reasoning about more limited systems. And if that theory seems to have a flaw, that's disconcerting if you're used to treating it as your mathematical basis for reasoning about other things.

But if that flaw seems tightly related to that whole "demand perfect accuracy" thing, I don't see why it's worth chasing a correction to a theory that's only a very imperfect proxy and limiting case of real intelligence anyway, instead of just acknowledging that AIXI and related frames are not suited for reasoning about matters in which the agent isn't far larger than the thing it's supposed to model.

I just don't see how the self-modeling difficulty isn't just part of the broader issue that real intelligence needs to figure out imperfect but useful abstractions about the world to make predictions. Why not try to understand the mathematics of

thatprocess instead, if you want to understand intelligence? What makes the self-modeling look like a special case and a good attack point to the people who pursue this?Firstly, thanks for reading the post! I think you're referring mainly to realisability here which I'm not that clued up on tbh, but I'll give you my two cents because why not.

I'm not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably "how can we get good abstractions of the world, given that we cannot perfectly model it". However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven't read that much into it yet. I'll link again to this video series on IB (which I'm yet to finish) as I think there are probably some good arguments here.

I have read about this many years ago. We did a proof of the Gödel's incompleteness theorem at university. And I still have the feeling like there is some sleight of hand involved... that even when I am telling the proof, I am merely repeating the teacher's password. My intuition has no model of what this all

actually means.From my perspective, understanding math means that I can quickly say why something works, what it implies, how would the situation change if we slightly changed the conditions, which part of the entire proof is the critical one, etc.

So frustrating...

I'm not sure if this is what you're looking for, but Hofstadter gives a great analogy using record players which I find useful in terms of thinking about how changing the situation changes our results (which is paraphrased here).

every possible soundcan't actually play its own self-breaking sound, so it is incomplete by virtue of its strength.We may think of the hi-fi record player as a formal system like Peano Arithmetic: the incompleteness arises precisely because it is strong enough to be able to capture number theory. This is what allows us to use Gödel Numbering, which then allows PA to do meta-reasoning about itself.

The only way to fix it is to make a system that is weaker than PA, so that we cannot do Gödel Numbering. But then we have a system that isn't even trying to express what we mean by number theory. This is the low-fi record player: as soon as we fix the one issue of self-reference, we fail to capture the thing we care about (number theory).

I think an example of a weaker formal system is Propositional Calculus. Here we do actually have completeness, but that is only because Propositional Calculus is too weak to be able to capture number theory.

What exactly is the aspect of natural numbers that makes them break math, as opposed to other types of values? Intuitively, it seems to be the fact that they can be

arbitrarily largebutnot infinite.Like, if you invent another data type that only has a finite number of values, it would not allow you to construct something equivalent to Gödel numbering. But if it allows infinite number of (finite) values, it would. (Not sure about an infinite number of/including infinite values, probably also would break math.)

It seems like you cannot precisely define natural numbers using first-order logic. Is that the reason of this all? Or is it a red herring? Would situation be somehow better with second-order logic?

(These are the kinds of questions that I assume would be

obviousto me, if I grokked the situation. So the fact that they are not obvious, suggests that I do not see the larger picture.)