Morgan_Rogers

Escaping the Löbian Obstacle

I like the alief/belief distinction, this seems to carry the distinction I was after. To make it more formal, I'll use "belief" to refer to 'things which an agent can prove in its reasoning engine/language (L)', and "alief" to refer to beliefs plus 'additional assumptions which the agent makes about the bearing of that reasoning on the environment', which together constitute a larger logic (L'). Does that match the distinction you intended between these terms?

An immediate pedagogical problem with this terminology is that we have to be careful not to conflate this notion of belief with the usual one: an agent will still be able to prove things in L even if it doesn't believe (in the conventional sense) that the L-proof involved is valid.

There is a more serious formalization issue at play, though, which is the problem of expressing a negative alief. How does one express that an agent "does not alieve that a proof of X in L implies that X is true"? is classically equivalent to , which in particular is an assertion of both the existence of a proof of X and the falsehood of X, which is clearly far stronger than the intended claim. This is going off on a tangent, so for now I will just assume that it is possible to express disaliefs by introducing some extra operators in L' and get on with it.

So, to be absolutely clear on this: do you accept the mathematical proofs in the original paper (but propose some way around them), or do you think they are actually mistaken? Do you accept the proof of Lob's theorem itself, or do you think it is mistaken?

Yes. The mathematical proofs and Löb's theorem are absolutely fine. What I'm refuting is their relevance; specifically the validity of this claim:

An agent can only trust another agent if it believes that agent's aliefs.

My position is that *when their beliefs are sound* an agent only ever needs to *alieve* another agents *beliefs* in order to trust them. A definition of trust which fails to be reflexive is clearly a bad definition, and with this modified definition there is no obstacle because "beliefs being strictly weaker than aliefs" is the default (indeed, by Löb's theorem, the only reasonable possibility), and can be implemented symmetrically between multiple agents; no heirarchy of logics or other such solution is needed.

Note that following the construction in the article, the secondary agent B can only act on the basis of a valid L-proof, so there is no need to distinguish between trusting what B says (the L-proofs B produces) and what B does (their subsequent action upon producing an L-proof). Attaining this guarantee in practice, so as to be able to trust that B will do what they have promised to do, is a separate but important problem. In general, the above notion of trust will only apply to what another agent says, or more precisely to the proofs they produce.

> So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs;

But using proofs in L is precisely what I understood from "belief"; so, if you have something else in mind, I will need clarification.

In the language you introduce, and bearing in mind what I said above, this would be restated as "it doesn't matter what B's aliefs are". I can use this to illustrate why I needed to include the condition "when their beliefs are sound" in the above: suppose A and B have have differing/imperfect knowledge about the state of the environment, to the effect that B may disalieve the soundness of L, while A alieves it. The result is that B might alieve they are misleading A by communicating an L-proof which B alieves to be invalid. But if A's application of L is in fact sound (so B is in the wrong) then A can still trust and successfully apply the L-proof supplied by B.

> ...nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map.

But then, this part of the sentence seems to go completely off the rails for me; a chess-playing robot will be no good at all if the rules of chess don't bear on its moves. Can you clarify?

You're right, this part was a mistake. Such a robot will optimize play with respect to what it alieves are the rules, and so will become very good at what it implicitly alieves chess to be.

> This is why I said the agent doesn't change its reasoning.

I still don't understand this part at all; could you go into more depth?

I was taking "reasoning" here to mean "applying the logic L" (so manipulating statements of belief), since any assumptions lying strictly in L' are only applied passively. It feels strange to me to extend "reasoning" to include this implicit stuff, even if we are including it in our formal model of the agent's behaviour.

But then, your bolded statement seems to just be a re-statement of the Löbstacle: logical agents can't explicitly endorse their own logic L' which they use to reason, but rather, can only generally accept reasoning in some weaker fragment L.

It's certainly a restatement of Löb's theorem. My assertion is that there is no resultant obstacle.

Re the rest,

(And I also don't yet see what that part has to do with getting around the Löbstacle.)

It's not relevant to getting around the Löbstacle; this part of the discussion was the result of me proposing a possible advantage of the perspective shift which (I believe, but have yet to fully convince you) resolves the Löbstacle. I agree that this part is distracting, but it's also interesting, so please direct message me (via whatever means is available on LW, or by finding me on the MIRIx Discord server or AI alignment Slack) if you have time to discuss it some more.

What are some claims or opinions about multi-multi delegation you've seen in the memeplex that you think deserve scrutiny?

Critch's comments support an opinion I've held since I started thinking seriously about alignment: that the language we use to describe it is too simple, and ignores the fact that "human" interests (the target of alignment) are not the monolith they're usually presented as.

For your specific question about multi-multi, I only have limited access to the memeplex, so I'll just share my thoughts. Multi-multi delegation involves:

1. Compromise / resolution of conflicts of interest between delegators.

2. Mutual trust in delegators regarding communication of interests to delegatees.

3. Equitable control between delegators. This could be lumped in with conflicts of interest, but deserves special attention.

4. Capacity for communication and cooperation between delegatees.

... and some other aspects I haven't thought of. As far as I can see, though, the most important issues here would be addressed by consideration of single-multi and multi-single; multi-multi-specific problems will only be relevant when there are obstacles to communication between either delegators or delegatees (a conceivable future problem, but not a problem as long as the complexity of systems actually being constructed is limited).

Escaping the Löbian Obstacle

Seems like you missed my point that the meta-logical belief could just be "L is sound" rather than "L plus me is sound". Adding the first as an axiom to L is fine (it results in an L' which is sound if L was sound), while adding the second as an axiom is very rarely fine (it proves soundness and consistency of the whole system, so the whole system had better be too weak for Godel's incompleteness theorems to apply).

Aha! I knew I must be missing something, thanks for the clarification. That makes things easier. I'll continue to use L' to mean "L + Sound(L,S)", where S is a particular semantic map.

> A doesn't need B to believe that the logic is sound. Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), the semantic map is defined on L, not on L'.

My problem is that I still don't understand

how you propose for the agent to reason/behave differentlythan what I've described; so, your statement that it does in fact do something different doesn't help me much, sorry.

The semantic map is defined on L, not L' -- sure, this makes some sense? But this just seems to reinforce the idea that our agent can only "understand" the internal logic of agents who restrict themselves to only use L (not any meta-logical beliefs).

The Löbian obstacle is about *trust *in the *reasoning* performed by a subordinate agent; the promise of subsequent actions taken based on that reasoning are just a pretext for considering that formal problem. So if A constructs B to produce proofs in L, it doesn't matter what B's beliefs are, or even if B has any beliefs; B could just be manipulating these as formal expressions. If you insist that B's beliefs be incorporated into its reasoning, as you seem to want to (more on that below), then I'm saying it doesn't matter what extension of L the agent B is equipped with; it can even be an inconsistent extension, as long as only valid L-proofs are produced.

To me, this "ought" in the sentence reads as a prediction (basically an abuse of 'ought', which is in common usage basically because people make is/ought errors). I would prefer to re-phrase as "if phi is provable in L, then the interpretation of phi will be true" or less ambitiously "will probably be true".

Is your proposal that "X is true" should be taken as a statement of X's desirability, instead? Or perhaps X's normativity? That's what it means to put it on the "ought" side, to me. If it means something different to you, we need to start over with the question of what the is/ought divide refers to.

The word "normative" sticks out to me as potential common ground here, so I'll use that language. The specified semantic map determines what is "actually" true, but its content is not *a priori* knowledge. As such, the only way for A's reasoning in L to have any practical value is if A works under the (normative) assumption that provability implies/will imply truth under S.

If this sounds farfetched, consider how police dramas employ the normative nature of truth for dramatic effect on a regular basis. A rogue detective uses principles of reasoning that their superiors deem invalid, so that the latter do not expect the detective's deductions to yield results in reality... Or perhaps the practice of constructive mathematics would be a more comfortable example, where the axiom of choice is rejected in favour of weaker deductive principles. A dedicated constructivist could reasonably make the claim that their chosen fragment of logic determines what "ought" to be true.

However, Tarski's analysis also shows how to build a stronger logic L' which knows the semantic map for L. So if "move to the other side" means represent via more formal axioms (which is what I take you to mean), you seem to be wrong.

A compromise which was suggested to me this past week is to work with a two-level logic. It seems to me disingenuous that "A reasons in L" should mean "actually, A reasons in the stronger logic L' which incorporates all the axioms it needs to ensure that provability in L implies truth", as the Tarskian construction suggests. After all, if A attempts to justify that decision to itself, it seems to me to trigger an infinite regress of more and more powerful logics. Identifying and segregating the axioms which refer to the semantic map S as "meta-logical beliefs" is enough to avoid such an existential crisis, just as we only need to employ PA + Sound(PA,N), where N is our favourite model of the natural numbers, to use PA to reason about those natural numbers.

Justified, perhaps, but also non-existent, right? You say the agent doesn't change its reasoning. So the reasoning is exactly the same as the lobstacle agent from the paper. So it doesn't conclude its own soundness. Right??

You seem exasperated that I'm not incorporating the meta-logical beliefs into the formal system, but this is because *in practice meta-logical beliefs are entirely implicit, and do not appear in the reasoning system used by agents*. If I build a chess-playing robot, its program will not explicitly include the assumption that the rules of chess it carries are the correct ones, nor will it use such assumptions in determining its moves, because such determinations are made without any reference to a semantic map. This is why I said the agent doesn't change its reasoning.

The metalogical beliefs thus are only relevant when an agent applies its (completed) reasoning to its environment via the semantic map. We can formalize that situation via the extended logic L' if you like, in which case this is probably the answer that you keep demanding:

**Both A and B "reason" in L' (B could even work in a distinct extension of L), but will only accept proofs in the fragment L. Since the axioms in L' extending L are identifiable by their reference to the semantic mapping, there is no risk of "belief contamination" in proofs, if that were ever a concern.**

I expect there is more I should say to justify how this avoids the obstacle, but I will wait for your feedback first.

Sure, but

logical induction doesn't know anything about the intended semantics. It doesn't make a lick of difference to the algorithm whether you believe that PA refers to the standard model. Nor does it feature in the mathematical results.Thus, logical induction would appear to be precisely what you call for at the end of your post: a theory of rationality which reasons in a semantics-independent way.

Without an intended semantics, what the probabilities assigned to formulas can only be interpreted as beliefs/levels of certainty about provability of statements from axioms (which it also believes with more or less certainty). This is great, but as soon as you want your logical inductor to reason about a particular mathematical object, the only way to turn those beliefs about provability into beliefs about truth in the model, you need to extend the inductor (explicitly or implicitly) with meta-logical beliefs about the soundness of that map, since it will update its beliefs based on provability *even if its proof methods aren't sound*.

As such, I feel you've misunderstood me here. I don't want semantics-independent reasoning at all, if anything the opposite: reasoning that prioritises verifying soundness of a logic wrt specified semantic maps in an empirical way, and which can adapt its reasoning system when soundness is shown to fail. A logical inductor isn't equipped with the capacity to work in mere fragments of its logic (it could be modified to do so, but that's certainly not the main *point* of this algorithm, as I understand it), so can only go as far as identifying when its reasoning is not sound by observing that it believes a contradiction, without being able to do anything about it besides modifying the beliefs it holds in its axioms indefinitely.

Escaping the Löbian Obstacle

I should pre-emptively correct my "formal" argument, since it's not true that S can *never* be in its own codomain; arguably I can construct U so that C(U) contains the names of some semantic maps as elements (although in this purely set-theoretic set-up, it's hard to see how doing so would capture their content). Nonetheless, a diagonalisation argument that depends only on L and C(U) being non-trivial demonstrates that C(U) cannot contain *every* semantic map, which I think should be enough to salvage the argument.

Escaping the Löbian Obstacle

It seems like you just get a new system, L', which believes in the soundness of L, but which doesn't believe in its own soundness. So the agent can trust agents who use L, but cannot trust agents who additionally have the same meta-logical beliefs which allow

themto trust L. Meaning, the agent cannot trust itself.

A doesn't need B to believe that the logic is sound. Even if you decide to present "logic L plus metalogical beliefs" as a larger logic L' (and assuming you manage to do this in a way that doesn't lead to inconsistency), the semantic map is defined on L, not on L'. In the practical situation under consideration when arriving at the Löbian Obstacle, the consequence is that A needn't worry about forming beliefs about L' for the purposes of trusting B's (actions based on) proofs in L to be sound.

I suppose if the meta-logical belief is "L plus this very meta-logical belief is sound", yeah.

I didn't understand this remark; please could you clarify?

I don't understand why you would put truth on the "ought" side of the is/ought divide, or if you do, how it helps us out here.

To put soundness in is/ought form, the belief that A must hold is that "if phi is provable in L, the interpretation of phi (via the semantic map) ought to be true". Truth can't be moved to the other side, because as I've tried to explain (perhaps unsuccessfully) the logic doesn't include its own semantics, and it's always possible to take contrarian semantics which fail to be sound applications of L. (see also final point below)

I don't get how "interpreting lobs theorem differently" helps us out here, or, what it really means.

The interpretation which results in the Löbian Obstacle is "Löb's theorem tells us that a logical agent can't trust its own reasoning, because it can't prove that reasoning is sound," and under that interpretation it seems that extreme measures must be taken to make formal logic relevant to AI reasoning, which is counterintuitive since we humans employ formalizable reasoning every day without any such caveats. In this post I'm saying "Löb's theorem reminds us that a logical agent cannot make *a priori* claims about the soundness of its own reasoning, because soundness is a statement about a semantic map, which the logic has no *a priori *control over".

This isn't concrete enough for me. Suppose an agent has just such a mapping, in its back pocket. Seems like the same exact proofs as in the paper go through... so just having such a mapping doesn't "immediately dissolve" the lobian obstacle. I suppose you mean that the agent

changes its reasoningsomehow. Yet you explicitly state that we don't just add to the list of axioms. So how does the meta-logical belief interact with everything else? Something about the action condition is different?

No, the agent doesn't change its reasoning. Assume that the agent A is reasoning soundly about their environment, which is to say that their semantic mapping is sound. Then A's belief in the soundness of their reasoning is justified. The change is that we don't require A to prove that their semantic mapping is sound, because A cannot do that, and I'm claiming that *this doesn't break anything*.

If you want me to make it more formal, here: suppose I have a logic L and a universe U. For simplicity, let's say U is a set. The semantic mapping is a map from the collection of formulas of L to the collection of concepts in U; it may be the case that symbols in L get mapped to collections of objects, or collections of collections, but for argument's sake we can assume the codomain to be some number of applications of the powerset operation to U, forming a collection C(U) of "concepts". So it's a mapping S: L --> C(U). The crucial thing is that S is *not* a member of C(U), so it can't be the image of any symbol in L under S. That is, S is outside the realm of things described by S, and this is true for any such S! Since "phi is provable in L means phi is true *under S*" is a statement involving S (even if the 'under S' is usually left implicit), it cannot be the interpretation under S of any formula in L, and so cannot be *stated*, let alone proved.

What do you mean when you say that a logical inductor has an implicit assumption of particular fixed semantics?

Most of the logics we encounter in mathematical practice are built with an* intended semantics*. For example, Peano Arithmetic contains a bunch of symbols which are often informally treated as if they are "the" natural numbers, despite the fact that they are no more than formal symbols and that the standard natural numbers are not the only model of the theory. In the context of logic applied to AI, this results in a conflation between the symbols in a logic employed by an agent and the "intended meaning" of those labels. This happens in the logical induction paper when discussing PA: the formulas the agent handles are assumed to carry their intended meaning in arithmetic.

Actually, that's misconstruing the formal results of the paper, since logical inductors have formal systems as their subjects rather than any fixed semantics. However, it's clear from the motivation and commentary (even within the abstract) that the envisaged use-case for inductors is to model an agent forming beliefs about the *truth* of formal statements, which is to say their validity in some specific model/semantics.

I don't know how you so misread what I said; I explicitly wrote that aliefs constitute the larger logic, so that beliefs are contained in aliefs (which I'm pretty sure is what you were going for!) and not vice versa. Maybe you got confused because I put beliefs first in this description, or because I described the smaller of the two logics as the "reasoning engine" (for the reason I subsequently provided)? What you said almost convinced me that our definitions actually align, until I reached the point where you said that that beliefs could be "more complicated" than aliefs, which made me unsure.

Anyway, since you keep taking the time to thoroughly reply in good faith, I'll do my best to clarify and address some of the rest of what you've said. However, thanks to the discussion we've had so far, a more formal presentation of my ideas is crystallizing in my mind; I prefer to save that for another proper post, since I anticipate it will involve rejigging the terminology again, and I don't want to muddy the waters further!

Perhaps we have different intuitive notions of trust, since I certainly trust myself (to perform "mission-critical tasks"), at least as far as my own logical reasoning is concerned, and an agent that doesn't trust itself is going to waste a lot of time second-guessing its own actions. So I don't think you've addressed my argument that the definition of trust that leads to the Löbstacle is faulty because it fails to be reflexive.

There are two separate issues here, and this response makes it apparent that you are conflating them. The fact that the second agent in the original Löbstacle paper

is constrained to act only once it has produced a provably effective strategy and is constrained to follow that strategymeans that the Löbstacle only applies to questions concerning the (formal)reasoningof a subordinate agent. Whether or not I manage to convince you that the Löbstacle doesn't exist (because it's founded on an untenable definition of trust), you have to acknowledge that the argument as presented there doesn't address the following second problem. Suppose I can guarantee that my subordinate uses reasoning that I believe to be valid. How can I guarantee that it will act on that reasoning in a way I approve of? This is (obviously) a rather general version of the alignment problem. If you're claiming that Löb's theorem has a bearing on this, then that would be big news, especially if it vindicates your opinion that it is "doomed to fail".The reason I see my post as progress is that currently the Löbstacle is blocking serious research in using simple systems of formal agents to investigate such important problems as the alignment problem.

Taking the revised definition of trust I described, that last sentence is no longer the content of any formal mathematical result in that paper, so I do not agree with it, and I stand by what I said.

Indeed, my claim boils down to saying that there is no problem. But I don't see why that doesn't constitute a solution to the apparent problem. It's like the Missing Dollar Riddle; explaining why there's no problem

isthe solution.There's no real way for me to know. Everyone who I've spoken to about this in person has gotten it, but that only amounts to a handful of people. It's hard to find an audience; I hoped LW would supply one, but so far it seems not. Hopefully a more formal presentation will improve the situation.