Earlier this year, when looking for an inroad to AI safety, I learned about the Löbian Obstacle, which is a problem encountered by 'purely logical' agents when trying to reason about and trust one another. In the original paper of Yudkowsky and Herreshoff [1], they show that a consequence of Löb's theorem is that an agent X can only "trust" the reasoning of an agent Y with a strictly weaker reasoning system than themselves, where "trust" here means 'formally prove that the conclusions of the other agent's reasoning will be true'. As stated, this looks like a major problem if X is a human trying to build an artificially intelligent system Y, but it's also a problem for any individual (embedded) agent trying to reason about their own future behaviour.

I'm not the first person to find this problem counterintuitive, and for good reason. In this article I'm going to explain why a formal (purely syntactic) logic system alone is a poor model of the reasoning of embedded agents, and show that by fixing this, we remove the foundation for the difficulties arising from Löb's theorem.

For the uninitiated, there is a handy survey of application of Löb's theorem in AI safety research by Patrick LaVictoire [6].

## Pure syntax

First, I should explain the formal set-up for applying Löb's theorem to agents. We model an agent's reasoning with a formal language, or *logic*, which I'll call L. Here I shall make the further assumption that this logic fits (or can be squeezed into) a formal language of the kind logicians are familiar with: the logic consists of some formal symbols or *variables* A,B,C... along with some *logical connectives, *operators and quantifiers for combining variables into expressions, or *formulas. *The agent is also assumed to carry some *inference rules *for manipulating formulas. Altogether, this data constitutes the * syntax *of L (its symbolic content and the rules for manipulating those symbols).

**Since we don't care precisely what the symbols in L refer to**, we need go no further, and we can start worrying about how this formal system behaves.

To arrive at the Löbian Obstacle, we suppose that an agent A (who reasons with L) builds an agent B who shall also reason with the logic L. Suppose, specifically, that A is building B to help them achieve some goal G. In order for A to ensure B will be helpful, A programs B to produce a proof that B's actions will lead to the goal G before acting. I should clarify that we are assuming that A and B inhabit an idealized environment over which they have complete control and perfect information, so that such a proof represents a guarantee of success.

Since A doesn't know in advance what B needs to do to achieve G (else there would be no dilemma!) there is a large number of formulas in L which could be output by B as 'actions provably achieving G'. Being maximally pessimistic (incidentally, this assumption is another one I find problematic, but I'll allow it for the time being) we can assume it could be any formula in L. Thus, in order for A to be certain that B will behave helpfully, it would seem that A must be able to prove, for all such formulas, a theorem schema of the form , which is to say that the existence of a proof of (which will eventually be supplied by B) implies that is true.

This is the point at which Löb's Theorem wags its finger: in the words of Yudkowsky and Herreshoff, "no formal system at least as powerful as Peano Arithmetic can prove its own soundness schema". In other words, it would seem that A cannot possibly prove in L that it will be able to trust an arbitrary output from B.

## The problem of induction in disguise

There's one obvious problem here, which I hinted at in the introduction: the same argument applies to A reasoning about itself! The only means that A has of concluding that its own actions will result in the desired consequences are by reasoning in L, and A has no way of proving that its musings in L have any bearing on its environment. This can be interpreted as an extremely condensed version of Hume's problem of induction [5], which consists in the observation that there is no purely logical basis for our inferences regarding, for example, the consequences of our actions.

I contend that applying Löb's Theorem to a purely syntactic system is a rather arcane way of arriving at this conclusion, since it's much easier to see how an agent's reasoning can fail to be sound with a simple example.

Earlier, I emphasised the phrase "**since we don't care precisely what the symbols in L refer to,**" a claim which I will now make more precise. While the precise details of any given implementation aren't important for arguing about generic behaviour, it *does* matter that the symbols in L refer to *something, *since the veracity (which is to say the "actual" truth value) of the soundness schema depends on it. Let me throw some vocabulary out for the important concepts here.

The symbols and formulas of L don't come with intrinsic meaning (no formal language does). To imbue them with meaning, we require a mapping from formulas in L to objects, values or concepts which the agent is able to identify in their environment. This mapping is called the * semantics *of L. The semantic map is typically constructed inductively, starting at interpretations of variables and extending this to formulas one logical connective or quantifier at a time. From another perspective, this mapping is an attempt to express the environment as a

*model*of the logic L.

In order for the agent to reason correctly about its environment (that is, come to valid conclusions using L) it is necessary that the semantics is *sound*, which is to say that all of the inference rules transform formulas whose interpretation is true into formulas of the same kind. In other words, the semantics should be *truth-functional* with respect to the inference rules of L.

It's easy to see how soundness can fail. Consider a chess-playing agent. If the agent isn't aware of some rule for how chess pieces may be moved (the agent isn't aware of the possibility of castling, say) then it's obvious how, under circumstances where those rules apply, they may come to incorrect conclusions regarding the game. The logic involved in this case may be much simpler than Peano Arithmetic!

For our purposes here, the take-away is that **soundness of the semantics cannot be guaranteed by syntax alone**. Again, we don't need Löb's Theorem to demonstrate this; producing non-sound semantics for any given logic is rarely difficult, since it suffices to deliberately make the semantics fail to be truth-functional (by deliberately misinterpreting formal conjunctions as "or", say).

## Soundness as a belief

No agent is able to justify their system of reasoning from immutable first principles; this is another presentation of Hume's problem of induction. If that claim sounds contentious, I highly recommend Nancy Cartwright's essay [2] on relativism in the philosophy of science. And yet, we (humans) continue to reason, and to expect the conclusions of our reasoning to represent valid statements about reality.

The crucial fact underpinning this apparent contradiction is that every agent, embedded or otherwise, must hold **metalogical beliefs** regarding the soundness of their reasoning. In the very simplest case of an agent with a rigid reasoning system, this will consist of a belief that the chosen semantics for their logic is sound. In other words, while the soundness schema appearing in Löb's Theorem can never be proved in L, *the agent necessarily carries a metalogical belief that (the metalogical version of) this soundness schema holds*, since this belief is required in order for A to confidently apply any reasoning it may perform within L.

A related metalogical belief is that of consistency: an agent cannot prove the consistency of its own reasoning (Gödel's second incompleteness theorem), but it must nonetheless believe that its logic is consistent for soundness to be a possibility in the first place.

Allowing for the existence of metalogical beliefs immediately dissolves the Löbian Obstacle, since this belief extends to proofs provided by other agents: as soon as agent A encounters a proof in L from any source, they will expect the conclusion to be validated in their environment by the same virtue that they expect the conclusions of their own reasoning to be valid. Agent A can delegate the task of proving things to subordinates with more computational power and still be satisfied with the outcome, for example.

## Where did the problems go?

I do not expect that the perception shift I'm proposing can have magically eliminated self-reference problems in general. A critical reader might accuse me of simply ignoring or obfuscating such problems. Let me try to convince you that my proposed shift in perspective achieves something meaningful, by explaining how Löb's theorem rears its head.

A devotee of formal logic might naively hope from my exposition so far that these so-called metalogical beliefs of soundness can simply be encoded as axioms in an expanded logic L' containing L. But Löb's theorem tells us that this will automatically make L' inconsistent!

Thus, I am advocating here that we interpret Löb's theorem differently. Rather than concluding that a logical agent simply cannot believe its own proofs to be valid (a conclusion which we humans flout on a daily basis), which brings us to a seemingly insurmountable obstacle, I am proposing that we relegate assertions of soundness to beliefs at the interface between purely formal reasoning and the semantics of that reasoning. This distinction is protected by Hume's guillotine: the observation that there is a strict separation between purely formal reasoning and value judgements, where the latter includes judgements regarding truth. The agent's beliefs cannot be absorbed into the logic L, precisely because syntax can never subsume (or presume) semantics.

To make this concrete, I am proposing that soundness assertions (and metalogical beliefs more generally) must include the additional data of a semantic mapping, an *aboutness condition*. Note that this doesn't preclude an agent reasoning about its own beliefs, since there is nothing stopping them from building a semantic map from their logic into their collection of beliefs, or indeed from their logic into itself. Incidentally, the proof of Löb's theorem requires just such a mapping in the form of a Gödel numbering (in modern modal logic proofs, the 'provability' modality takes on this role), although it is usually only a *partial *semantic map, since it is constrained to an interpretation of a representation of the natural numbers in L into a collection of suitably expressive formulas/proofs in L.

## Empirical belief updates

The impossibility of an *a priori* guarantee of soundness forces an abstract intelligent agent to resort to empiricism in the same way that we human agents do: the agent begins from a hypothetical belief of soundness with less than absolute certainty, and must test that belief in their environment. When failures of soundness are observed, the agent must adapt. Depending on the architecture of the agent, the adaptation may consist of updates to L, an update to the semantic map, or an update to the belief value (if the truth values employed by the agent in their beliefs are sufficiently structured to express domains of truth, say). I am deliberately hinting at some interesting possibilities here, which I may explore in more detail in a later post.

Under the belief that the Löbian Obstacle was a genuine problem to be avoided, several approaches have been proposed in the past decade. Foremost is identifying a probabilistic regime in which Löb's Theorem fails to hold, and by extension fails to produce an obstacle [3]. In the setting of agents reasoning about mathematics or purely formal systems, this has been vastly expanded into the concept of a *logical inductor* [4] which assigns probabilities to all formulas in its logic (under the implicit assumption of particular fixed semantics!).

However, fixed semantics are inflexible, and any system which assigns values (such as probabilities) directly to formulas in its logic must be extremely data-heavy, since it must store (encodings of) large numbers of formulas explicitly rather than implicitly. An immediate advantage of recognizing that semantics is, and indeed must be, separate from syntax is that it highlights the possibility that the semantic map may vary. Rather than carrying around a heavy knowledge base consisting of all of the statements which it knows to be true (with respect to a fixed semantics), an agent may instead apply the same formal reasoning in different situations and carry the much lighter load of empirical knowledge of situations in which its reasoning will be sound. Or, with a little refinement, the agent can carry an understanding of which fragment of its logic can be applied in a given situation.

To me, this feels a lot closer to my own experience of reasoning: I have some factual knowledge, but I often rely more heavily on my understanding of how to apply my reasoning abilities to a given situation.

---

[1] *Tiling Agents for Self-Modifying AI, and the Löbian Obstacle*, Yudkowsky, E and Herreshoff, M. intelligence.org, 2013

[2] *Relativism in the Philosophy of Science*, Cartwright, N., from *Relativism, a contemporary anthology, *2010, Columbia University Press. Available on researchgate.

[3] *Probabilistic Löb Theorem, *Armstrong, S., lesswrong.org, 2013

[4] *Logical Induction*, Garrabrant, S. *et al.*, arxiv, 2020

[5] *The Problem of Induction,* Henderson, L., *The Stanford Encyclopedia of Philosophy *(Spring 2020 Edition), Edward N. Zalta (ed.), url

[6] *An Introduction to Löb’s Theorem in MIRI Research, *LaVictoire, P., 2015, intelligence.org

It seems like you just get a new system, L', which believes in the soundness of L, but which doesn't believe in its own soundness. So the agent can trust agents who use L, but cannot trust agents who additionally have the same meta-logical beliefs which allow

themto trust L. Meaning, the agent cannot trust itself.I suppose if the meta-logical belief is "L plus this very meta-logical belief is sound", yeah.

I don't understand why you would put truth on the "ought" side of the is/ought divide, or if you do, how it helps us out here. I would argue that the lobian obstacle paper already makes an appropriate use of the guillotine,

in thatit points out that it was only thenaiveassumption about the connection between is and ought (that is, the naive assumption about the connection between logic and taking actions) which created the obstacle. The solutions which the paper explores modify both the formal logicandthe conditions under which actions are taken, which seems to be a key insight.I don't get how "interpreting lobs theorem differently" helps us out here, or, what it really means.

Surely our attitudes about the agent don't themselves make the agent any more capable. You must have some modification of the agent designs (from the lobian obstacle paper) in mind. What is it?

This isn't concrete enough for me. Suppose an agent has just such a mapping, in its back pocket. Seems like the same exact proofs as in the paper go through... so just having such a mapping doesn't "immediately dissolve" the lobian obstacle. I suppose you mean that the agent

changes its reasoningsomehow. Yet you explicitly state that we don't just add to the list of axioms. So how does the meta-logical belief interact with everything else? Something about the action condition is different?What do you mean when you say that a logical inductor has an implicit assumption of particular fixed semantics? (I'm not saying this part is crazy, I'm just wondering what you mean.)

A doesn't need B to believe that the logic is sound. Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), the semantic map is defined on L, not on L'. In the practical situation under consideration when arriving at the Löbian Obstacle, the consequence is that A needn't worry about forming beliefs about L' for the purposes of trusting B's (actions based on) proofs in L to be sound.

I didn't understand this remark; please could you clarify?

To put soundness in is/ought form, the belief that A must hold is that "if phi is provable in L, the interpretation of phi (via the semantic map) ought to be true". Truth can't be moved to the other side, because as I've tried to explain (perhaps unsuccessfully) the logic doesn't include its own semantics, and it's always possible to take contrarian semantics which fail to be sound applications of L. (see also final point below)

The interpretation which results in the Löbian Obstacle is "Löb's theorem tells us that a logical agent can't trust its own reasoning, because it can't prove that reasoning is sound," and under that interpretation it seems that extreme measures must be taken to make formal logic relevant to AI reasoning, which is counterintuitive since we humans employ formalizable reasoning every day without any such caveats. In this post I'm saying "Löb's theorem reminds us that a logical agent cannot make

a prioriclaims about the soundness of its own reasoning, because soundness is a statement about a semantic map, which the logic has noa prioricontrol over".No, the agent doesn't change its reasoning. Assume that the agent A is reasoning soundly about their environment, which is to say that their semantic mapping is sound. Then A's belief in the soundness of their reasoning is justified. The change is that we don't require A to prove that their semantic mapping is sound, because A cannot do that, and I'm claiming that

this doesn't break anything.If you want me to make it more formal, here: suppose I have a logic L and a universe U. For simplicity, let's say U is a set. The semantic mapping is a map from the collection of formulas of L to the collection of concepts in U; it may be the case that symbols in L get mapped to collections of objects, or collections of collections, but for argument's sake we can assume the codomain to be some number of applications of the powerset operation to U, forming a collection C(U) of "concepts". So it's a mapping S: L --> C(U). The crucial thing is that S is

nota member of C(U), so it can't be the image of any symbol in L under S. That is, S is outside the realm of things described by S, and this is true for any such S! Since "phi is provable in L means phi is trueunder S" is a statement involving S (even if the 'under S' is usually left implicit), it cannot be the interpretation under S of any formula in L, and so cannot bestated, let alone proved.Most of the logics we encounter in mathematical practice are built with an

intended semantics. For example, Peano Arithmetic contains a bunch of symbols which are often informally treated as if they are "the" natural numbers, despite the fact that they are no more than formal symbols and that the standard natural numbers are not the only model of the theory. In the context of logic applied to AI, this results in a conflation between the symbols in a logic employed by an agent and the "intended meaning" of those labels. This happens in the logical induction paper when discussing PA: the formulas the agent handles are assumed to carry their intended meaning in arithmetic.Actually, that's misconstruing the formal results of the paper, since logical inductors have formal systems as their subjects rather than any fixed semantics. However, it's clear from the motivation and commentary (even within the abstract) that the envisaged use-case for inductors is to model an agent forming beliefs about the

truthof formal statements, which is to say their validity in some specific model/semantics.My problem is that I still don't understand

how you propose for the agent to reason/behave differentlythan what I've described; so, your statement that it does in fact do something different doesn't help me much, sorry.The semantic map is defined on L, not L' -- sure, this makes some sense? But this just seems to reinforce the idea that our agent can only "understand" the internal logic of agents who restrict themselves to only use L (not any meta-logical beliefs).

Seems like you missed my point that the meta-logical belief could just be "L is sound" rather than "L plus me is sound". Adding the first as an axiom to L is fine (it results in an L' which is sound if L was sound), while adding the second as an axiom is very rarely fine (it proves soundness and consistency of the whole system, so the whole system had better be too weak for Godel's incompleteness theorems to apply).

Does that make sense?

You were talking about the second sort of situation (where adding the meta-logical belief as an axiom would result in an inconsistent system, because it would claim its own soundness); I wanted to point out that we could also be in the first sort of situation (where adding the meta-logical belief would result in a perfectly consistent system, but would only let you trust L, not L').

To me, this "ought" in the sentence reads as a prediction (basically an abuse of 'ought', which is in common usage basically because people make is/ought errors). I would prefer to re-phrase as "if phi is provable in L, then the interpretation of phi will be true" or less ambitiously "will probably be true".

Is your proposal that "X is true" should be taken as a statement of X's desirability, instead? Or perhaps X's normativity? That's what it means to put it on the "ought" side, to me. If it means something different to you, we need to start over with the question of what the is/ought divide refers to.

I agree that (by Tarski's undefinability theorem) a logic can't know its own semantic map (provided it meets the assumptions of that theorem). At least, if we're on the same page about what a semantic map is.

However, Tarski's analysis also shows how to build a stronger logic L' which knows the semantic map for L. So if "move to the other side" means represent via more formal axioms (which is what I take you to mean), you seem to be wrong.

My problem with the general format of this argument is,

even if you re-interpret it. That is: Eliezer et al show that the assumptions of Lob's theorem apply, and thereby, they conclude a formal result. No amound of re-interpreting the theorem changes that; one might as well suggest that we solve the Lobstacle by forgetting Lob's theorem.the other interpretation of Lob's theorem is still validThat is what I meant when I said I don't get how it helps.

It seems to me like your post is missing a step where you design an alternative agent based on your improved thinking. Probably the alternative agent is supposed to be obvious from what you wrote, but I'm not getting it.

So how do you escape the conclusion from the paper??

Justified, perhaps, but also non-existent, right? You say the agent doesn't change its reasoning. So the reasoning is exactly the same as the lobstacle agent from the paper. So it doesn't conclude its own soundness. Right??

I have a couple of possible interpretations of what you might mean here, all of them really poor:

Right, so, we seem to be on the same page about what the semantic map basically is and Tarski's undefinability theorem and such.

But what I need is a sketch

of the agent that you claim solves the Lobstacle.Sure, but

. It doesn't make a lick of difference to the algorithm whether you believe that PA refers to the standard model. Nor does it feature in the mathematical results.logical induction doesn't know anything about the intended semanticsSo I would put it to you that logical induction does a good job of capturing what it means to reason in a semantics-ignorant way. If it also works well as a model of reasoning for semantic-ful things, well, that would seem to be a happy accident. And perhaps it doesn't: perhaps logical induction is missing an aspect of rationality around "semantics".

Thus, logical induction would appear to be precisely what you call for at the end of your post: a theory of rationality which reasons in a semantics-independent way.

Aha! I knew I must be missing something, thanks for the clarification. That makes things easier. I'll continue to use L' to mean "L + Sound(L,S)", where S is a particular semantic map.

The Löbian obstacle is about

trustin thereasoningperformed by a subordinate agent; the promise of subsequent actions taken based on that reasoning are just a pretext for considering that formal problem. So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs; B could just be manipulating these as formal expressions. If you insist that B's beliefs be incorporated into its reasoning, as you seem to want to (more on that below), then I'm saying it doesn't matter what extension of L the agent B is equipped with; it can even be an inconsistent extension, as long as only valid L-proofs are produced.The word "normative" sticks out to me as potential common ground here, so I'll use that language. The specified semantic map determines what is "actually" true, but its content is not

a prioriknowledge. As such, the only way for A's reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.If this sounds farfetched, consider how police dramas employ the normative nature of truth for dramatic effect on a regular basis. A rogue detective uses principles of reasoning that their superiors deem invalid, so that the latter do not expect the detective's deductions to yield results in reality... Or perhaps the practice of constructive mathematics would be a more comfortable example, where the axiom of choice is rejected in favour of weaker deductive principles. A dedicated constructivist could reasonably make the claim that their chosen fragment of logic determines what "ought" to be true.

A compromise which was suggested to me this past week is to work with a two-level logic. It seems to me disingenuous that "A reasons in L" should mean "actually, A reasons in the stronger logic L' which incorporates all the axioms it needs to ensure that provability in L implies truth", as the Tarskian construction suggests. After all, if A attempts to justify that decision to itself, it seems to me to trigger an infinite regress of more and more powerful logics. Identifying and segregating the axioms which refer to the semantic map S as "meta-logical beliefs" is enough to avoid such an existential crisis, just as we only need to employ PA + Sound(PA,N), where N is our favourite model of the natural numbers, to use PA to reason about those natural numbers.

You seem exasperated that I'm not incorporating the meta-logical beliefs into the formal system, but this is because

in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents. If I build a chess-playing robot, its program will not explicitly include the assumption that the rules of chess it carries are the correct ones, nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map. This is why I said the agent doesn't change its reasoning.The metalogical beliefs thus are only relevant when an agent applies its (completed) reasoning to its environment via the semantic map. We can formalize that situation via the extended logic L' if you like, in which case this is probably the answer that you keep demanding:

Both A and B "reason" in L' (B could even work in a distinct extension of L), but will only accept proofs in the fragment L. Since the axioms in L' extending L are identifiable by their reference to the semantic mapping, there is no risk of "belief contamination" in proofs, if that were ever a concern.I expect there is more I should say to justify how this avoids the obstacle, but I will wait for your feedback first.

Without an intended semantics, what the probabilities assigned to formulas can only be interpreted as beliefs/levels of certainty about provability of statements from axioms (which it also believes with more or less certainty). This is great, but as soon as you want your logical inductor to reason about a particular mathematical object, the only way to turn those beliefs about provability into beliefs about truth in the model, you need to extend the inductor (explicitly or implicitly) with meta-logical beliefs about the soundness of that map, since it will update its beliefs based on provability

even if its proof methods aren't sound.As such, I feel you've misunderstood me here. I don't want semantics-independent reasoning at all, if anything the opposite: reasoning that prioritises verifying soundness of a logic wrt specified semantic maps in an empirical way, and which can adapt its reasoning system when soundness is shown to fail. A logical inductor isn't equipped with the capacity to work in mere fragments of its logic (it could be modified to do so, but that's certainly not the main

pointof this algorithm, as I understand it), so can only go as far as identifying when its reasoning is not sound by observing that it believes a contradiction, without being able to do anything about it besides modifying the beliefs it holds in its axioms indefinitely.I'm not

insistingon anything about the agent design; I simply remain puzzled how your proposal differs from the one in the paper, and so, continue to assume it's like the one in the paper except those differences which I understand. Your statements have been reading almost like you think you've refuted formal arguments in the paper (implying you don't think the formal agent designs do what I think they do), rather than providing a new agent design which does better. This contributes to my feeling that you are really just wrestling with Lob's theorem itself. Nor do I recall you saying much to dissuade me of this. In your OP, you use the language "Under the belief that the Löbian Obstacle was a genuine problem to be avoided", as if the original idea was simply mistaken.So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob's theorem itself, or do you think it is mistaken?

But using proofs in L is precisely what I understood from "belief"; so, if you have something else in mind, I will need clarification.

I think I haven't understood much of this.

I would argue that the only way for A's reasoning in L to have practical value is for A to

acton such reasoning. The only way for A's reasoning to have practical valueis forfor some other agent Bto be working under the assumption that provability-in-L implies truth under S. Indeed, this seems in line with your detective drama example. If the superiors were put in the position of the rogue detective, often they would have come to the same conclusion. Yet, looking at the problem from a removed position, they cannot accept the conclusion. But this is precisely the lobstacle: the conditionBfor reasoning to have value to an agent Ais different from the conditionfor A's reasoning to have value to another agent B.Let me therefore make a (not totally formal) distinction based on the alief/belief idea:

This is also a version of the distinction between tacit knowledge (eg knowing how to ride a bike) vs explicit knowledge (eg being able to explain scientifically how bike-riding works). To say that an agent "alieves" a logic L is to say that it tacitly knows how to reason in L.

Roughly, I think you have been talking about beliefs, where I have been talking about aliefs.

Now, a "naive logical agent" (ie the simple agent design which falls prey to the Lobstacle, in the paper) has a few important properties:

alievesL, but does notbelieveL (this is a re-statement of the lobstacle).trustanother agent if itbelievesthat agent'saliefs.These statements are not necessarily true of logical agents in general, of course, since the paper gives some counterexamples (agents who can trust the same logic which they base their actions on). But those counterexamples all have some oddities. I presume your goal is to get rid of those oddities while still avoiding the lobstacle.

The thing I'm exasperated by is, rather, that

I don't understand what youwith the meta-logical beliefs if youaredoingaren'tincluding them in the formal system. It's fine if you don't include them in the formal system; just tell me what you are doing instead! This was the point of my "I have the semantic map here in my back pocket" argument: fine, I'm willing to grant that the agent 'has' these meta-logical beliefs, I just don't yet understand how you propose the agentusethem.I mean, I'm particularly exasperated because you

explicitly statethat the meta-logical beliefs don't change the agent's reasoning at all; it seems like you're boldly asserting that theyare justa talisman which the agent keeps in its back pocket, and that this somehow solves everything.In the new terminology I am proposing, I think perhaps what you are saying is that the meta-logical stuff is

aliefsand thus not apparent in thebeliefsof the system. What do you think of that interpretation?(Obviously, I don't think your proposal solves the Lobstacle in that case, but, that is a separate question.)

Right. To me this overtly reads as "a chess playing robot only

alievesthe rules of chess. It has no explicit endorsement of those rules." Which makes perfect sense to me.But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don't bear on its moves. Can you clarify?

I still don't understand this part at all; could you go into more depth?

This helps clarify a bit; it sounds like maybe we were conflating different notions of "reasoning". IE, I interpreted you as saying

the semantic map does not enter into the agent's algorithm anywhere, not in the logical deduction system nor in the reasoning which the agent uses to decide on actions based on those logical deductions; whereas you were sayingthe semantic map does not enter into the agent's logical deductions, but rather, has to do with how those deductions are used to make decisions.But I am still confused by your proposal:

(It seems worth clarifying that I'm only demanding that the meta-logical reasoning be formalized

for us,ie that you give a formal statement of the agent design which I can understand, not formalizedfor the agents, ie put into their formal system)Lob's theorem says that the conclusions X where L can believe its own reasoning to be sound are precisely the cases where L can prove X, that is, □(□X→X)→□X. So

normally, L' can accept proofs in L' just fine, if they are concretely given. The lobstacle has to do with a logictrusting itself in the abstract,that is, trusting proofs before you see them. (Hence, the lobstacle doesn't block an agent from trusting its future self if it can concretely simulate what its future self can do, because there will be no specific reasoning step which it will reject; what the lobstacle blocks is for an agent to trust its future self with unforseen chains of reasoning.)So, one interpretation: you propose to modify this situation, by restricting L' to

only accept proofs given in L.This interpretation seems to conform to my "back pocket" argument. Both A and B "reason" in L', but really, only accept L. (This explains why you said '"reason"' in scare quotes.) But, I'm left wondering where/how L' matters here, and how this proposal differs from just building naive logical agents who use L (and fall prey to the lobstacle).

Thus, I'm probably better off interpreting you as referring to "accepting proofs" in the abstract, rather than concretely given proofs. With this interpretation, it seems you mean to say that A and B

alieveL' (that is, they use L' internally), but onlyexplicitly endorseproofs in L (that is, they can only accept principles of reasoning which warrant L-proofs, not L'-proofs, when they abstractly consider which proofs are normative).But then, your bolded statement seems to just be a re-statement of the Lobstacle: logical agents can't explicitly endorse their own logic L' which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.

This part doesn't seem totally relevant any more, but to respond: sure, but the standard model is an infinite object, so we don't really know how to extend the inductor to (implicitly or explicitly) know the intended semantic map. (Although this is a subject of some interest for me.) It's plausible that there's no way. But then, it also becomes plausible that humans don't do this (ie don't actually possess a full semantic map for peano arithmetic).

Instead, I've come to think about us (like logical inductors) as being uncertain about the model (ie uncertain whether numbers with specific properties exist, rather than possessing any tool which would fully differentiate between standard and non-standard numbers).

Your critique of logical induction in the post was "fixed semantics are inflexible". But now it seems like you are the one who proposes to have a fixed semantics (carrying a specific semantic map, eg, the standard model), and modifying the logic when that semantic map shows the logic to be unsound. In contrast, logical induction is the one who proposes to have no fixed semantic map (instead being uncertain about the model), and modifying

thosebeliefs based on what the logic proves.My critique of this is only that I don't know how to "have the standard model". In general, you propose that the agent has a semantic map which can be used to check the soundness of the logic. But I am skeptical: a semantic map will usually refer to things we cannot check (EG the external world, or uncomputable facts about numbers). Don't get me wrong: I would love to be able to check those things more properly. It's just that I don't see a way (beyond what is already possible with logical reasoning, or more haphazardly but more powerfully, with the probabilistic extension of logical reasoning that logical induction gives us).

(And I also don't yet see what that part has to do with getting around the Lobstacle.)

I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I'll use "belief" to refer to 'things which an agent can prove in its reasoning engine/language (L)', and "alief" to refer to beliefs plus 'additional assumptions which the agent makes about the bearing of that reasoning on the environment', which together constitute a larger logic (L'). Does that match the distinction you intended between these terms?

An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn't believe (in the conventional sense) that the L-proof involved is valid.

There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent "does not alieve that a proof of X in L implies that X is true"? ¬(□X→X) is classically equivalent to □X∧¬X, which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L' and get on with it.

Yes. The mathematical proofs and Löb's theorem are absolutely fine. What I'm refuting is their relevance; specifically the validity of this claim:

My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them. A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle because "beliefs being strictly weaker than aliefs" is the default (indeed, by Löb's theorem, the only reasonable possibility), and can be implemented symmetrically between multiple agents; no heirarchy of logics or other such solution is needed.

Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof). Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.

In the language you introduce, and bearing in mind what I said above, this would be restated as "it doesn't matter what B's aliefs are". I can use this to illustrate why I needed to include the condition "when their beliefs are sound" in the above: suppose A and B have have differing/imperfect knowledge about the state of the environment, to the effect that B may disalieve the soundness of L, while A alieves it. The result is that B might alieve they are misleading A by communicating an L-proof which B alieves to be invalid. But if A's application of L is in fact sound (so B is in the wrong) then A can still trust and successfully apply the L-proof supplied by B.

You're right, this part was a mistake. Such a robot will optimize play with respect to what it alieves are the rules, and so will become very good at what it implicitly alieves chess to be.

I was taking "reasoning" here to mean "applying the logic L" (so manipulating statements of belief), since any assumptions lying strictly in L' are only applied passively. It feels strange to me to extend "reasoning" to include this implicit stuff, even if we are including it in our formal model of the agent's behaviour.

It's certainly a restatement of Löb's theorem. My assertion is that there is no resultant obstacle.

Re the rest,

It's not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it's also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.

Unfortunately, this seems

almost oppositeto the way I was defining the terminology. I had it that thealiefsare precisely what is proven by the formal system, and thebeliefsare what the agent would explicitly endorse if asked. Aliefs are what you feel in your bones. So if the "bones" of the agentarethe formal system, that's the aliefs.Note also that your definition implies that if an agent alieves something, it must also believe it. In contrast, part of the point for me is that an agent can alieve things without believing them. I would also allow the opposite, for humans and other probabilistic reasoners, though for pure-logic agents this would have to correspond to unsoundness. But pure-logical agents

have tohave the freedom to alieve without believing, on pain of inconsistency, even if we can't model belief-without-alief in pure logic.I find it interesting that you (seemingly) nodded along with my descriptions, but then proposed a definition which was almost opposite mine! I think there's probably a deep reason for that (having to do with how difficult it is to reliably distinguish alief/belief), but I'm not grasping it for now. It is a symptom of my confusion in this regard that I'm

not even sure we're pointing to different notions of belief/aliefeven though your definition sounds almost opposite to me. It is well within the realm of possibility that we mean the same thing, and are just choosing very different ways to talk about it.Specifically, your definition seems fine if L is not the formal language which the agent is hard-wired with, but rather, some logic which the agent explicitly endorses (like the relationship that we have with Peano Arithmetic). Then, yeah, "belief" is about provability in L, while "alief" implies that the agent has some "additional assumptions about the bearing of that reasoning on the environment". Totally! But then, this suggests that those

additional assumptionsare somehow represented in some other subsystem of the agent (outside of L).The logic of that other subsystemis whatI'minterested in. If that other subsystem uses L', then it makes sense that the agent explicitly believes L. But now thealiefsof the agent are represented by L'. That is: L' is the logic within the agent's bones. So it's L' that I'm talking about when I define "aliefs" as the axiomatic system, and "beliefs" as more complicated (opposite to your definition).Over this discussion, a

possible interpretationof what you're saying that's been in the back of my mind has been thatyou think agents should not rely on formal logic in their bones, but rather, should use formal logic only as part of their explicit thinkingIE a form of thinking which other, "more basic" reasoning systems use as a tool (a tool which they can choose to endorse or not). For example, you might believe that an RL system decides whether to employ logical reasoning. Or deep learning. Etc. In this case, you might say that there is no L' to find (no logical system represents the aliefs).To be clear, I

dothink this is a kind of legitimate response to the Lobstacle: the response of rejecting logic (as a tool for understanding what's going on in an agent's "bones" IE their basic decision architecture). This approach says: "Look what happens when you try to make the agent overly reliant on logic! Don't do that!"However, the Lobstacle is specifically about using logic to describe or design decision-making procedures. So, this 'solution' will not be very satisfying for people trying to do that. The puzzling nature of the Lobstacle remains: the claim is that RL (or something) has to basically solve the problem; we can't use logic. But why is this? Is it because agents have to "be irrational" at some level (ie, their basic systems can't conform to the normative content of logic)?

Anyway, this may or may not resemble your view. You haven't explicitly come out and said anything like that, although it does seem like you think there should be a level beyond logic in some sense.

It seems like you think this property is so important that it's almost definitional, and so, a notion of belief which doesn't satisfy it is in conflict with the usual notion of belief. I just don't have this intuition. My notion of belief-in-contrast-to-alief does contrast with the informal notion, but I would emphasize other differences:

much lessimplies that you'll act on a belief; our intuitive notion more often predicts that you'll act on something if you really believe it.In other words, I see the intuitive notion of belief as really consisting of two parts, belief and alief, which are intuitively assumed to go together but which we are splitting apart here.

The property you mention is derived from this conflation, because

in factwe need to alieve a reasoning process in order to believe its outputs; so if we're conflating alief and belief, then it seems like we need to believe that L-proofs are valid in order to see L-proofs and (as a direct result) come to believe what's been proved.But this is precisely what's at stake, in making the belief/alief distinction: we want to point out that this isn't a healthy relationship to have with logic (indeed, Godel shows how it leads to inconsistency).

I like that you pointed this out, but yeah, it doesn't seem especially important to our discussion. In any case, I would define disalief as something like this:

Note that the agent might also

disbelievein such a rule, IE, might expect some such proofs to have false conclusions. But this is, of course, not necessary. In particular it would not be true in the case of logics which the agent explicitly endorses (and therefore must not alieve).Hrm. I suspect language has failed us, and we need to make some more distinctions (but I am not sure what they are).

In my terms, if A alieves B's beliefs, then (for example) if B explicitly endorses ZFC, then A must be using ZFC "in its bones". But if B explicitly endorses ZFC, then the logic which B is

usingmust bemore powerfulthan that. So A might be logically weaker than B (if Aonlyalieves ZFC, and not the stronger system which B used to decide that ZFC is sound). If so, A cannot trust B (A does not alieve that B's reasoning is sound, that is, A does not believe B).I have to confess that I'm confusing myself a bit, and am tempted to give yet another (different, incompatible) definition for the alief/belief split. I'll hold off for now, but I hope it's very clear that I'm not accusing all confusion about this as coming from you -- I'm aiming to minimize confusion, but I still worry that I've introduced contradictory ideas in this conversation. (I'm actually tempted to start distinguishing 3 levels rather than 2! Alief/belief are relative terms, and I worry we're actually conflating important subtleties by using only 2 terms rather than having multple levels...)

This goes back to the idea that you seem to think "belief in X implies belief in the processes whereby you came to believe in X" is so important as to be definitional, where I think this property has to be a by-product of other things.

In my understanding, the definition of "trust" should not explicitly allow or disallow this, if we're going to be true to what "trust" means. Rather, for the Lobstacle, "A trusts B" has to be defined as "A willingly relies on B to perform mission-critical tasks". This definition does indeed fail to be true for naive logical agents. But this should be an argument against naive logical agents, not

our notion of trust.Hence my perception that you do indeed have to question the theorems themselves, in order to dispute their "relevance" to the situation. The definition of trust seems fixed in place to me; indeed, I would instead have to question the relevance of your alternative definition, since what I actually want is the thing studied in the paper (IE being able to delegate critical tasks to another agent).

Ok, but if agent B can only act on valid L-proofs, it seems like agent B has been given a frontal lobotomy (IE this is just the "make sure my future self is dumber" style of solution to the problem).

Or, on the other hand, if the agent A

alsorespects this same restriction, then Acannot delegate tasks to B(because A can't prove that it's OK to do so, at least not in L, the logic which it has been restricted to use when it comes to deciding how to act).Which puts us back in the same lobstacle.

Is this a crux for you? My thinking is that this is going to be a deadly sticking point. It seems like you're admitting that your approach has this problem, but, you think there's value in what you've done so far because you've solved one part of the problem and you think this other part could also work with time. Is that what you're intending to say? Whereas to me, it looks like this other part is just doomed to fail, so I don't see what the value in your proposal could be.

For me, solving the Lobstacle

meansbeing able to actually decide to delegate.I think I get what you're saying here. But if the assumptions lying strictly in L' are only applied passively, how does it help us? I'm thinking your current answer is (as you've already said) "trusting that B will do what they've said they'll do is a separate problem" -- IE you aren't even trying to build the full bridge between B thinking something is a good idea and A trusting B with such tasks.

I still really, really don't get why your language is stuff like "there's no resultant obstacle" and "what I'm refuting is their relevance". Your implication is "there was not a problem to begin with" rather than "I have solved the problem". I asked whether you objected to details of the math in the original paper, and you said no -- so apparently you would agree with the result that naive logical agents fail to trust their future self (which is the lobstacle!). Solving the lobstacle would presumably involve providing an improved agent design which would avoid the problem. Yet this seems like it's not what you want to do -- instead you claim something else, along the lines of claiming that there's not actually a problem?

So, basically, what do you claim to accomplish? I suspect I'm still really misunderstanding that part of it.

Since you're now identifying it as another part of the perspective shift which you're trying to communicate (rather than just some technical distraction), it sounds like it might actually be pretty helpful toward me understanding what you're trying to get at. But, there

arealready a lot of points floating around in this conversation, so maybe I'll let it drop.I'm somewhat curious if you think you've communicated your perspective shift to any other person; so far, I'm like "there just doesn't seem to be anything real here", but maybe there are other people who get what you're trying to say?

I don't know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I'm pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the "reasoning engine" (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be "more complicated" than aliefs, which made me unsure.

Anyway, since you keep taking the time to thoroughly reply in good faith, I'll do my best to clarify and address some of the rest of what you've said. However, thanks to the discussion we've had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don't want to muddy the waters further!

Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform "mission-critical tasks"), at least as far as my own logical reasoning is concerned, and an agent that doesn't trust itself is going to waste a lot of time second-guessing its own actions. So I don't think you've addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.

There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper

is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategymeans that the Löbstacle only applies to questions concerning the (formal)reasoningof a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn't exist (because it's founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn't address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you're claiming that Löb's theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is "doomed to fail".The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.

Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.

Indeed, my claim boils down to saying that there is no problem. But I don't see why that doesn't constitute a solution to the apparent problem. It's like the Missing Dollar Riddle; explaining why there's no problem

isthe solution.There's no real way for me to know. Everyone who I've spoken to about this in person has gotten it, but that only amounts to a handful of people. It's hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.

I should pre-emptively correct my "formal" argument, since it's not true that S can

neverbe in its own codomain; arguably I can construct U so that C(U) contains the names of some semantic maps as elements (although in this purely set-theoretic set-up, it's hard to see how doing so would capture their content). Nonetheless, a diagonalisation argument that depends only on L and C(U) being non-trivial demonstrates that C(U) cannot containeverysemantic map, which I think should be enough to salvage the argument.