Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This text originated from a retreat in late 2018, where researchers from FHI, MIRI and CFAR did an extended double-crux on AI safety paradigms, with Eric Drexler and Scott Garrabrant in the core.  In the past two years I tried to improve it in terms of understandability multiple times, but empirically it seems quite inadequate. As it seems unlikely I will have time to invest further work into improving it, I'm publishing it as it is, with the hope that someone else will maybe understand the ideas even at this form, and describe them more clearly.

The box inversion hypothesis consists of the two following propositions

  1. There exists something approximating a duality / an isomorphism between technical AI safety problems in the Agent Foundations agenda and some of the technical problems implied by the Comprehensive AI Services framing
  2. The approximate isomorphism holds between enough properties that some solutions to the problems in one agenda translate to solutions to problems in the other agenda

I will start with an apology - I will not try to give my one paragraph version of the Comprehensive AI Services. It is an almost 200 pages long document, conveying dozens of models and intuitions. I don’t feel like being the best person to give a short introduction. So, I just assume familiarity with CAIS. I will also not try to give my short version of the various problems which broadly fit under the Agent Foundations agenda, as I assume most of the readers are already familiar with them.

0. The metaphor: Circle inversion

People who think geometrically rather than spatially may benefit from looking at a transformation of a plane called circle inversion first. A nice explanation is here - if you have never met the transformation, pages 1-3 of the linked document should be enough. 

You can think about the “circle inversion” as a geometrical metaphor for the “box inversion”. 

1. The map: Box inversion

The central claim is that there is a transformation between many of the technical problems in the Agent Foundations agenda and CAIS. To give you some examples

  • problems with daemons <-> problems with molochs
  • questions about ontologies <-> questions about service catalogues
  • manipulating the operator <-> addictive services
  • some “hard core” of safety (tiling, human-compatibility, some notions of corrigibility) <-> defensive stability, layer of security services
  • ...

The claim of the box inversion hypothesis is that this is not a set of random anecdotes, but there is a pattern, pointing to a map between the two framings of AI safety. Note that the proposed map is not exact, and also is not a trivial transformation like replacing "agent" with "service".  

To explore two specific examples in more detail:

In the classical "AI in a box" picture, we are worried about the search process creating some inner mis-aligned part, a sub-agent with misaligned objectives. 

In the CAIS picture, one reasonable worry is the evolution of the system of services hitting a basin of attraction of so-called moloch - a set of services which has emergent agent-like properties, and misaligned objectives. 

Regarding some properties, the box inversion turns the problem “inside out”: instead of sub-agents the problem is basically with super-agents. 

Regarding some abstract properties, the problem seems similar, and the only difference is where we draw the boundaries of the “system”.  

2. Adding nuance

Using the circle inversion metaphor to guide our intuition again: some questions are transformed into exactly the same questions. For example, a question whether two circles intersect is invariant under the circle inversion. Similarly, some safety problems stay the same after the "box inversion". 

This may cause an incorrect impression that the agendas are actually exactly the same technical agenda, just stated in different language. This is not the case - often, the problems are the same in some properties, but different in others. (Vaguely said, there is something like a partial isomorphism, which does not hold between all properties. Someone familiar with category theory could likely express this better.)

It is also important to note that apart from the mapping between problems, there are often differences between CAIS and AF in how they guide our intuitions on how to solve these problems. If I try to informally describe the intuition

  • CAIS is a perspective which is rooted in engineering, physics and continuity"continuum" 
  • Agent foundations feel, at least for me, more like coming from science, mathematics, and a "discrete/symbolic" perspective

(Note that there is also a deep duality between science and engineering, there are several fascinating maps between "discrete/symbolic" and "continuum" pictures, and, there is an intricate relation between physics and mathematics. I hope to write more on that and how it influences various intuitions about AI safety in some other text.)

3. Implications

As an exercise, I recommend to take your favourite problem in one of the agendas, and try to translate it to the other agenda via the box inversion.

Overall, if true, I think the box inversion hypothesis provides some assurance that the field as a whole is tracking real problems, and some seemingly conflicting views are actually closer than they appear. I hope this connection can shed some light on some of the disagreements and "cruxes" in AI safety. From the box inversion perspective, they sometimes seem like arguing whether things are inside or outside of the circle of symmetry in a space which is largely symmetrical to circular inversion.

I have some hope that some problems may be more easily solvable in one view, similarly to various useful dualities elsewhere. At least in my experience for many people it is usually much easier to see some specific problem in one of the perspectives than the other. 

4. Why the name

In one view, we are worried that the box, containing the wonders of intelligence and complexity, will blow up in our face. In the other view, we are worried that the box, containing humanity and its values, with wonders of intelligence and complexity outside, will crush upon our heads.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 7:37 AM

Maybe roles - or something like that - are the connecting element. 

Disclaimer: I'm not too familiar with either AF or CAIS being just an LW regular.

I have been thinking about the unsolved principal-agent-problem (PAP) for quite a while. Both for theoretical reasons as a solution to the AI alignment problem as well as practically as I work as a CTO of a growing company and we have a growing number of agents that need alignment ;-) 

It appears that companies have mostly found relatively reliable ways to solve the PAP in practice. Methods are taught and used by MBAs. There is no mathematical theory that explains PAP - it seems more like engineering to me. Social engineering if you want. In my own management, I want to apply evidence-based methods and I hoped to find clear proven methods. I read management advice with an eye on possible mathematical principles. I don't claim I have found any but I am building an intuition of what it could be.

Key elements are roles and processes. You will hear that a lot that you need to have them. But what is that actually, a role? Or a process? Where does it come from? How is it established? I have established a few processes in our growing startup always wondering what I'm doing. Always trying to notice and make explicit what caused the change, noticing phase transitions in growth, how with a growing number of agents existing rules stop to work (or start to work or rather being efficient compared to the alternatives). A lot of why this works is based on common knowledge and creating it - or using it. 

What does that mean for the box inversion? I tried to apply the intuitions I have built to the box inversion hypothesis. My proposal is that it could be something like roles. When an agent delegates something to a sub-agent (as in the AF) then "delegating" means expectation to conform to a role. While in the CAIS it is the other way around: A lot of participants find themselves in roles of the system and pushing against that.

Not sure any of this makes sense and for sure that is no hidden analogy to physics or something like that. Just my 2ct.

I would love to see (and contribute to, if you want to collaborate) a post on "what are roles and processes" in terms of human organizations, and how it might apply to agent alignment topics.  I spend a lot of my time and energy at work (Principal Engineer at a very large company; somewhat similar to CTO of a 150-person division) in formalizing and encouraging people and teams to adopt processes and to understand the roles they need to embrace in order to have the (positive) impact we all want.

There's an interesting mix in this work - some of it is identifying goals we share and looking for ways to measure and improve at furthering them.  But some of it is normalizing the goals themselves - not exactly "alignment", but "finding and formalizing of mutually-beneficial utility trades".   These are visible, causal trades - nothing fancy except that they're rarely encoded as actual written agreements - they're informal beliefs within the employees' heads, based on implicit relationships between teams or with customers.

I share the intuition that this correspondence exists, and it's been present in the background of my thoughts for at least five years. Thanks for putting it into words!

I'm not sure if you mean a true mathematical mapping, or a conceptual mapping with the math as an analogy only.  If the former, this should perhaps be a sphere (or hypersphere) inversion, rather than a box.  If the latter, are there aspects of the circle (or sphere or box) mathematical definition that you want to preserve, in order to clarify other aspects?

For instance, does the un-mappable nature of the origin have meaning in these mappings?  Does the fact that outside distances are non-linearly related to inside distances (inside things near the center are close together, but map to things far apart outside, and inside things near the edge stay roughly the same distance from each other in the outside mapping) mean something in this model?