Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As the Conceptual Boundaries Workshop (website) is coming up, and now that we're also planning Mathematical Boundaries Workshop in April, I want to get more clarity on what exactly it is that you want out of «boundaries»/membranes.

So I just want to check: Is your goal with boundaries just to formalize a moral thing? 

I'll summarize what I mean by that:

  • Claim 1: By "boundaries", you mean "the boundaries around moral patients— namely humans".
    • Claim 1b: And to some degree also the boundaries around plants and animals. Also maybe nations, institutions, and other things.
  • Claim 2: If we can just 
    • (i) locate the important boundaries in the world, and then 
    • (ii) somehow protect them, 
    • Then this gets at a lot (but not all!) of what the "safety" in "AI safety" should be. 
  • Claim 3: We might actually be able to do that↑.
    • e.g.: Markov blankets are a natural abstraction for (2.i).
  • Claim 4: Protecting boundaries won't be sufficient for all of "safety" and there are probably also other (non-boundaries) specifications/actions that will also be necessary.
    • For example, we would probably also need to separately specify some things that aren't obviously contained by the boundaries we mean, e.g.: "clean water", "clean air", and a tractably small set of other desiderata.

Here are my questions for you:

  • Q1: Do you agree with each of the claims above?
  • Q2: Is your goal with boundaries just to formalize the moral/safety thing, or is there anything else you want from boundaries?

Past context that's also relevant for readers:

Q3: It seems that Garrabrant, Critch, and maybe others want different things from you and I'm wondering if you have thoughts about that.

  • Garrabrant: From talking to him I know that he's thinking about boundaries too but more about boundaries in the world as instruments to preserve causal locality and predictability and evolution etc.. But this is quite different than talking about specifically the boundaries around agents
  • Critch: I haven't spoken to him yet, but I think you once told me that Critch seems to be thinking about boundaries more in terms of ~"just find the 'boundary protocol' and follow it and all cooperation with other agents will be safe". Is this right? If so, this seems closer to what you want, but still kinda different.
  • TJ: I think TJ has some other ideas that I am currently unable to summarize.

Claim 1+1b: yes, to first order.  [To second order, I expect that the general concept of things with «boundaries» will also be useful for multi-level world-modelling in general, e.g. coarse-graining fluid flow by modelling it in terms of cells that have boundaries on which there is a net flow, and that it might be a good idea to "bake in" something like a concept of boundaries to an AI system's meta-ontology, so that it has more of a tendency to have moral patients among the entities in its object-level ontology. But my mainline intention is for the object-level ontology to be created with humans in the loop, and the identification of entities with boundaries could perhaps be just as easily a layer of interpretation on top of an ontology with a more neutral meta-ontology of causation. Thinking through both routes more is at the frontier of what I consider "conceptual «boundaries» research".]


Claim 2: agreed.
Claim 3: agreed.
Claim 4: agreed.


Q2: yes, my ultimate goal with «boundaries» is just to formalise injunctions against doing harm, disrespecting autonomy, or (at the most ambitious) excluding humans from cooperation. (I am borrowing the pluralism of Garrett Cullity's Concern, Respect, & Cooperation in separating those three cases. However, as discussed in my response to Claim 1, there might also be bonus benefits where incorporating «boundaries» at a meta-ontological level (prior to the moral/normative/axiological specifications) makes multi-scale world-modelling go better (and/or makes the models more amenable to using «boundaries» to formalise the injunctions).


Q3: Your models seem roughly right to me. (I, too, consider attempting to summarize TJ's ideas as a bit of a risky proposition, as they have a lot of nuance that is easy to get wrong. Perhaps we could invite them to join the dialogue.) I have a lot of respect for all three people. My current view is that:

  • Critch's full vision of «boundary protocols», if it works out, would be strictly superior to my current vision of «boundaries», but I don't currently see how to fill in the mathematical details of «boundary protocols». However, given that the mathematical details of «boundaries» (simpliciter) are due to Critch, I am keen to find out if he (perhaps with the support of other workshop participants) can write down some indications about how to formalise «boundary protocols» too.
  • My understanding from past conversations with Garrabrant is that he wants to remove time from the ontological dependencies for defining «boundaries». The notion of time used in the current formalism is a global discrete clock, which is clearly inadequate. I tend to prefer a partial order defining causal precedence, which is better and may even be adequate for my purposes. Garrabrant's perspective on causation and time, which incorporates insights from logical induction, is clearly even better, and I would not be shocked if it somehow turns out to be crucially necessary. It is even harder to work out how to define «boundaries» in that setting, but surely worth spending some time on. 

Okay, I'll summarize what I learned from your messages:

  • re Q1 Claim 1 and Q2- 
    • You think it might also be helpful to have boundaries as a ~primitive in the OAA's multi-level world-model.
  • re Q2- 
    • I liked this: "formalise injunctions against doing harm, disrespecting autonomy, or (at the most ambitious) excluding humans from cooperation". 
    • Hm, I understand how boundaries help with preventing harm and preventing disrespect to autonomy. But I don't immediately understand how boundaries help with preventing "excluding humans from cooperation". I'll have to ask about that.
  • re Q1, Q2, Q3- 
    • You roughly agree with the other stuff I wrote.
  • re Q3- 
    • Garrabrant has some inquiries about logical time vs physical time, wrt boundaries. 
    • Hm, I don't understand this in detail. But I also don't feel inclined to dig into it right now.

New questions I have for you:

  • Q2.1: Could you give one short canonical example of how boundaries might possibly help prevent "excluding humans from cooperation"? 
  • Q3.1: Could you give one short canonical example of what the boundary protocol thing is and how it would be good? 
  • Q3.2: You said "given that the mathematical details of «boundaries» (simpliciter) are due to Critch", but I'm not sure this is true… Why do you say this? 
    • Context: I assume you're referring to his boundaries 3a post. 
      • (Note for the reader: I summarize some of that post here.)



Q2.1: e.g. using boundaries as a basis for identifying humans in a world, and then using that as the basis for defining a cooperative game in which the humans are treated as player agents, even if the AI system has a good enough model of the internals that it more natively sees humans as NPCs. I haven't thought much about this direction though


Q3.1: A canonical example is that a person may consent to have their physical boundaries violently invaded by a blade by a surgeon for the purpose of excising a cancerous tumour. Without boundary protocols, there would be nothing these people could do to prevent an overzealous boundary-protector from preventing the blade puncturing the skin.


Q3.2: Yes, I am.


Q2.1: Oh, that's neat I've also had thoughts in a similar direction. 

(Ie: Other agents are only sovereign from your perspective to the extent that you lack information about their internal state. If you had a high res brain scan of someone and the ability to simulate it, they would not be sovereign from your perspective, and the boundaries abstraction falls apart.)


Q3.1: ah ok


Q3.2: ok


That's all of my questions right now. I'll just publish?




Oh I forgot to mention:  This should also all be considered in the context of Davidad's Open Agency Architecture (OAA). His big end-to-end AI safety plan for which he has ~£50M from the UK gov over the next 3 years. Programme thesis.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:26 AM

I think it's worth noting that, in addition to boundaries where one might expect a single intuitive boundary to exist (like the above example of identifying human agents or consenting to tumor excision), there might be types of boundaries where one of multiple potential boundaries must be chosen via subjective preference (for instance, I think one could use boundaries to formalize property rights or the sovereignty of communities or municipalities, but in practice those are both initially constrained by culture, local laws, and/or customs, and therefore vary wildly).

Accounting for the potential of subjective boundaries has some nice properties:

  • You can still use boundaries if more than one intuitive emergent boundary definition can work.
  • It's easier to adapt boundaries to existing policy environments, which would facilitate deployment of an open agency architecture (i.e. you can simply build and deploy systems to match current legally protected boundaries, rather than needing to adjust or only enforce a subset of the norms/laws that can be derived objectively)

I wouldn't be surprised if workshop participants believe this is out of scope. My guess is that emergent optimal boundaries may be sufficient for all use cases related to the preservation of life, which might be all that are considered in the workshop, but I think it's pretty easy to think of corner cases where preferences are relevant to answer questions, like

  • "When does a fetus/baby get its own boundary?", or
  • "if bodily boundaries are based on the continuation of life/metabolism, is euthanasia always a boundary violation?", or
  • "what toxins and in what quantity can I feed someone before it's a boundary violation?" (Is the CO2 from my breath okay? A little ethanol? A lot? -- feels hard to come up with a single, universal compelling heuristic to me, but that probably speaks more to my limits than the solution space.)