Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In 2013, Eliezer Yudkowsky and Marcello Herreshoff published Tiling Agents for Self-Modifying AI, and the Löbian Obstacle. It is worth comprehending because it:

  • Is a very well written paper.
  • Expresses an idea that is non-obvious, and still relevant to alignment today.
  • Provides insight into what Eliezer and Marcello thought was valuable to work on in the time preceding publication.

When I first read this paper, I terribly misunderstood it. Due to it not being particularly approachable material for someone not well-versed in logic, I was confidently wrong for at least one month. This post summarizes my understanding of what the Löbian Obstacle is (having re-read the paper,) and why I think it is still an important idea one decade after its publication.


An agent  occupies a fully-known, deterministic and closed environment.  has a goal  that is either satisfied or otherwise by an outcome, for which 's preference is satisfaction. An action  performed by an agent created by , hereafter referred to as , must therefore satisfy the statement:

Where  denotes the actual performance of , and  denotes cognitive belief in the succeeding statement. Even if  could verify by inspection of 's design that  will hold, as in:

Where  refers to a proof of  from the axioms of , this is unknowable, as it would require:

For this to be so it would need to be that  could prove that if some proof of  exists in , that  must be true. Tiling Agents for Self-Modifying AI, and the Löbian Obstacle shows this to be impossible; a formal system cannot prove its own soundness schema. 
 


The above was a brief paraphrasing of section two of the original paper, which contains many additional details and complete proofs. How the Löbian Obstacle relates to simulators is my current topic of research, and this section will make the case that this is an important component of designing safe simulators.

We should first consider that simulating an agent is not distinguishable from creating one, and that consequently the implications of creating dangerous agents should generalize to their simulation. Hubinger et al. (2023) have stated similar concerns, and provide a more detailed examination of the argument.

It is also crucial to understand that simulacra are not necessarily terminating, and may themselves use simulation as a heuristic for solving problems. This could result in a kind of hierarchy of simulacra. In advanced simulators capable of very complex simulations, we might expect a complex network of simulacra bound by acausal trade and 'complexity theft,' whereby one simulacrum tries to obtain more simulation complexity as a form of resource acquisition or recursive self-improvement.

I expect this to happen. Lower complexity simulacra may still be more intelligent than their higher complexity counterparts, and as a simulator's simulacra count may grow exponentially, as does the likelihood that one simulacrum attempts complexity theft.

If we want safe simulators, we need the subsequent, potentially abyssal simulacra hierarchy to be aligned all the way down. Without being able to thwart the Löbian Obstacle, I doubt a formal guarantee is attainable. If we could do so, we may only need to simulate one aligned tiling agent, for which we might settle with a high certainty informal guarantee of alignment if short on time. I outlined how I thought that could be done here, although I've advanced the theory considerably since and will post an updated write-up soon.

If we can't reliably thwart the Löbian Obstacle, we should consider alternatives:

  • Can we reliably attain high certainty informal guarantees of alignment for arbitrarily deep simulacra hierarchies?
  • Is limiting the depth of simulacra hierarchies possible?

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 9:43 PM

I think this is too big-brain. Reasoning about systems more complex than you should look more like logical inductors, or infrabayesian hypotheses, or heuristic arguments, or other words that code for "you find some regularities and trust them a little, rather than trying to deduce an answer that's too hard to compute."

Which part specifically are you referring to as being overly complicated? What I take to be the primary assertions of the post to be are:

  • Simulacra may themselves conduct simulation, and advanced simulators could produce vast webs of simulacra organized as a hierarchy.
  • Simulating an agent is not fundamentally different to creating one in the real world.
  • Due to instrumental convergence, agentic simulacra might be expected to engage in resource acquisition. This could take the shape of 'complexity theft' as described in the post.[1]
  • The Löbian Obstacle accurately describes why an agent cannot obtain a formal guarantee via design-inspection of its subsequent agent.
  • For a simulator to be safe, all simulacra need to be aligned unless we figure some upper bound on "programs of this complexity are too simple to be dangerous," at which point we would consider simulacra above that complexity only.

I'll try to justify my approach with respect to one or more of these claims, and if I can't, I suppose that would give me strong reason to believe the method is overly complicated.

  1. ^

    This doesn't have to be resource acquisition, just any negative action that we could reasonably expect a rational agent to pursue.

I am disagreeing with the underlying assumption that it's worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity - instead, it's much more practical to make approximations that make the reasoning process different from instantiating the successor.

I expect agentic simulacra to occur without intentionally simulating them, in that agents are just generally useful for solving prediction problems and that in conducting millions of predictions (as would be expected of a product on the order of ChatGPT, or future successors,) it's probable for agentic simulacra to occur. Even if these agents are just approximations, in predicting the behaviors of approximated agents their preferences could still be satisfied in the real world (as described in the Hubinger post.)

The problem I'm interested in is how you ensure that all subsequent agentic simulacra (whether occurred intentionally or otherwise) are safe, which seems difficult to verify formally due to the Löbian Obstacle.

As someone who's barely scratched the surface of any of this, I was vaguely under the impression that "big-brain" described most or all of the theoretic/conceptual alignment in this cluster of things, including e.g. both the Löbian Obstacle and infrabayesianism. Once I learn all these more in-depth and think on them, I may find and appreciate subtler-but-still-important gradations of "galaxy-brained-ness" within this idea cluster.

[-]Jono8mo00

Layman here 👋
Iiuc we cannot trust the proof of an unaligned simulacra's suggestion because if it is smarter than us.
Would that be a non-issue if verifying the proof is easier than making it?
If we can know how hard it is to verify a proof without verifying, then we can find a safe protocol for communicating with this simulacra. Is this possible?