It might be possible to formalize what it means for an agent/moral patient to be safe via membranes/boundaries. This post tells one (just one) story for how the membranes idea could be useful for thinking about existential risk and AI safety.

Formalizing “safety” using agent membranes

A few examples:

  • A bacterium uses its membrane to protect its internal processes from external influences.
AI generated image: microscope view of a amoeba
 
  • A nation maintains its sovereignty by defending its borders.
 
  • A human protects their mental integrity by selectively filtering the information that comes in and out of their mind.
AI generated image: a brain surrounded by a fence
 

A natural abstraction for agent safety?

Agent boundaries/membranes seem to be a natural abstraction representing the safety and autonomy of agents.

  • A bacterium survives only if its membrane is preserved.
  • A nation maintains its sovereignty only if its borders aren’t invaded.
  • A human mind maintains mental integrity only if it can hold off informational manipulation.

Maybe the safety of agents could be largely formalized as the preservation of their membranes.

Distinct from preferences!

Boundaries are also cool because they show a way to respect agents without needing to talking about their preferences or utility functions. Andrew Critch has said the following about this idea:

my goal is to treat boundaries as more fundamental than preferences, rather than as merely a feature of them.  In other words, I think boundaries are probably better able to carve reality at the joints than either preferences or utility functions, for the purpose of creating a good working relationship between humanity and AI technology («Boundaries» Sequence, Part 3b)

For instance, respecting the boundary of bacterium would probably mean “preserving or not disrupting its membrane” (as opposed to knowing its preferences and satisfying them).

Protecting agents and infrastructure

By formalizing and preserving the important boundaries in the world, we could be in a better position to protect humanity from AI threats. Examples:

 
  • It may be possible to specify or agree on a minimal “membrane” for each agent/moral patients humanity values, such that when each membrane is preserved, that agent largely stays safe and maintains its autonomy over the inside of its membrane. 
    • If your physical boundary isn't violated, you don't die. If your mental boundary isn't violated aren't manipulated. Etc…
    • Note: if your membrane is preserved, this just means that you stay safe and that you maintain autonomy over everything with your membrane. It does not necessarily mean that you get actively positive outcomes to occur in the outside world. This is all about bare minimum safety.
    • See this thread below and davidad's comment
  • Similarly, it may be possible to formalize and enforce the boundaries of physical property rights.

This is for safety, not full alignment

Note that this is only about specifying safety, not full alignment.

See: Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Caveats

I don't think the absence of membrane piercing formalizes all of safety, but I think it gets at a good chunk of what "safety" should mean. Future thinking will have to determine what more is required. 

What are examples of violations of agent safety that do not involve membrane piercing? 

Markov blankets

How might membranes/boundaries be formalized mathematically? Markov blankets seem to be a fitting abstraction. 

Diagram:

Notice that there are no arrows directly between the agent and its environment. Ideally, all influence from one to the other flows through the boundary/membrane (e.g.: your skin).

In which case,

  • Infiltration of information across this Markov blanket measures membrane piercing, and low infiltration indicates the absence of such piercing.
  • (And it may also be useful to keep track of exfiltration across the Markov blanket?[1])

For more details, see distillation Formalizing «Boundaries» with Markov blankets

Also, there are probably other information-theoretic measures that are useful for formalizing membranes/boundaries.

Protecting agent membranes/boundaries

See: Protecting agent boundaries.


Subscribe to the boundaries/membranes LessWrong tag to get notified of new developments.

Thanks to Jonathan Ng, Alexander Gietelink Oldenziel, Alex Zhu, and Evan Miyazono for reviewing a draft of this post.

  1. ^

    exfiltration, i.e.: privacy and the absence of mind-reading. But I need to think more about this. Related section: “Maintaining Boundaries is about Maintaining Free Will and Privacy” by Scott Garrabrant.

New to LessWrong?

New Comment
45 comments, sorted by Click to highlight new comments since: Today at 8:58 PM

Good stuff.

What's "piercing"?

  1. Is it piercing a membrane if I speak and it distracts you, but I don't touch you otherwise?
  2. What about if I destroy all your food sources but don't touch your body?
  3. What if you're dying and I have a cure but don't share it?
  4. What if I enclose your house completely with concrete while you're in it?
  5. How about if I give you food you would have chosen to buy anyway, but I give it to you for free?
  6. What about if I offer you a bad trade I know you'll choose to make because of an ad you just saw?
  7. What about if I'm the one showing you an ad rather than simply being in the right place at the right time to take advantage of someone else's ad?

These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To "pierce" a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the boundary).

So, to your particular cases:

  1. Probably not. There is no fundamental difference between sound and contact. Rather, the fundamental difference is between the usual flow of information through the senses and other flows of information that are possible in the concrete physical world-model but not represented in the abstraction. An interaction that pierces the membrane is one which breaks the abstraction barrier of perception. Ordinary speech acts do not. Only sounds which cause damage (internal state changes that are not well-modelled as mental states) or which otherwise exceed the "operating conditions" in the state space of the «boundary» layer (e.g. certain kinds of superstimuli) would pierce the «boundary».
  2. Almost surely not. This is why, as an agenda for AI safety, it will be necessary to specify a handful of constructive goals, such as provision of clean water and sustenance and the maintenance of hospitable atmospheric conditions, in addition to the «boundary»-based safety prohibitions.
  3. Definitely not. Omission of beneficial actions is not a counterfactual impact.
  4. Probably. This causes prediction error because the abstraction of typical human spatial positions is that they have substantial ability to affect their position between nearby streets by simple locomotory action sequences. But if a human is already effectively imprisoned, then adding more concrete would not create additional/counterfactual prediction error.
  5. Probably not. Provision of resources (that are within "operating conditions", i.e. not "out-of-distribution") is not a «boundary» violation as long as the human has the typical amount of control of whether to accept them.
  6. Definitely not. Exploiting behavioural tendencies which are not counterfactually corrupted is not a «boundary» violation.
  7. Maybe. If the ad's effect on decision-making tendencies is well modelled by the abstraction of typical in-distribution human interactions, then using that channel does not violate the «boundary». Unprecedented superstimuli would, but the precedented patterns in advertising are already pretty bad. This is a weak point of the «boundaries» concept, in my view. We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all: any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected. Another approach is Mariven's criterion for deception, but applying this criterion requires modelling human mental states as beliefs about the world (which is certainly not 100% scientifically accurate). I would like to see more work here, and more different proposed approaches.
Reply5422111

Definitely not. Omission of beneficial actions is not a counterfactual impact.

You're sure this is the case even if the disease is about to violate the <<boundary>> and the cure will prevent that?

We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all

Unfortunately this is probably not on the table, as they are currently being used as weapons in economic warfare between the USA, China, and everyone else. tiktok primarily educational inside china. Advertisers have direct incentive to violate. We need a way to use <<membranes>> that will, on the margin, help protect against anyone violating them, not just avoid doing so itself.

he says a bit in this direction- see my other comment

Here's a tricky example I've been thinking about:

Is a cell getting infected by a virus a boundary violation?

What I think makes this tricky is that viruses generally don't physically penetrate cell membranes. Instead, cells just "let in" some viruses (albeit against their better judgement). 


Then once you answer the above, please also consider:

Is a cell taking in nutrients from its environment a boundary violation?

I don't know what makes this different from the virus example (at least as long as we're not allowed to refer to preferences).

any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected.

I want to give a big +1 on preventing membrane piercing not just by having AIs respect membranes, but also by using technology to empower membranes to be stronger and better at self-defense.

Thanks for writing this! I largely agree (and the rest I need to think more about)

Edit: just see Davidad's comment

Hmmm. It's becoming apparent to me that I don't want to regard membrane piercing as a necessarily objective phenomenon. Membrane piercing certainly isn't always visible from every perspective.

That said, I think it's still possible to prevent "membrane piercing", even if whether it occurred can be somewhat subjective. 

Responding to some of your examples:

Is it piercing a membrane if I speak and it distracts you, but I don't touch you otherwise

Again: I don't actually care so much about whether this is or isn't a membrane piercing, and I don't want to make a decision on that in this case. Instead, I want to talk about what actions taken by which agents make the most sense for preventing the outcome if we do consider it to be a membrane piercing.

In most everyday cases, I think the best answer is "if someone's actions are supposedly distracting you, you shouldn't blame anyone else for distracting you, you should just get stronger and become less distractible". I believe this because it can be really hard to know other agent's boundaries, and if you just let other agents tell you your boundaries you can get mugged too easily.

However, in some cases, self-defense is infact insufficient, and usually in these cases as a society we collectively agree that e.g. "no one should blow an airhorn in your ear-- in this case we're going to blame the person that did that"

What about if I destroy all your food sources but don't touch your body?

It depends on how far out we can find the membranes. For example, if the membranes go so far out as to include property rights then this could be addressed.

What if I enclose your house completely with concrete while you're in it?

Again depends on how far out we go with the membranes: in this case, probably: how much of the law is included.

It depends on how far out we can find the membranes. For example, if the membranes go so far out as to include property rights then this could be addressed.

I sort of agree, but my food sources are not my property, they're a farmer's property.

I edited numbers into my questions, could you edit to make your response numbered and get each one?

It seems somewhat easy to think of examples of ways to harm an agent without piercing its membrane, eg killing its family, isolating it, etc. The counter thought would be that there are different dimensions of the membrane that extend over parts of the world. For example part of my membranes extend over the things I care about, and things that affect my survival.

The question then becomes how to quantify these different membranes and in terms of interacting with other systems how they can be helpful to you without harming or disturbing these other membranes.

the family unit as an agent has a aggregate meta-membrane though, doesn't it? this is why I'd expect to need an integral over possible membranes, and the thing we'd need to do to make this perspective useful is find some sturdy, causal-graph information-theoretic metric that reliably identifies agents. compare discovering agents

kill its family

Huh interesting. 

To be clear I think this probably emotionally harms most humans, but ultimately it's that's up to whatever interpretations and emotional beliefs that person has (which are all flexible, in principle).

The counter thought would be that there are different dimensions of the membrane that extend over parts of the world. For example part of my membranes extend over the things I care about, and things that affect my survival.

Yes

The question then becomes how to quantify these different membranes and in terms of interacting with other systems how they can be helpful to you without harming or disturbing these other membranes.

If I understand what you're saying (and I may not)- yes, this

if I and my friend work together well, aren't we an aggregate being that has a new membrane that needs protecting?

from the post:

a minimal “membrane” for each agent/moral patients humanity values value

Many membranes (ie: many possible Markov blankets if you could observe them all) are not valued, empirically.

right, which translates into, it's not a uniform integral, there's some sort of weighting.

but I don't retract my argument that the moral value of my relationship with my friend means that me and my friend acting together as a friendship means that the friendship has a membrane. How familiar are you with social network analysis? if not very, I'd suggest speedwatching at least the first half hour of https://www.youtube.com/watch?v=2ZHuj8uBinM which should take 15m at 2x speed. I suggest this because of the way the explanation and the visual intuitions give a framework for reasoning about networks of people.

we also need to somehow take into account when membranes dissipate but this isn't a failure according to the individual beings.

could you restate your argument again plainly i missed it

groups working together to be an interaction have a aggregate meta-membrane: for any given group who are participating in an interaction of some kind, the fact of their connection is a property of {the mutual information of some variables about their locations and actions or something} that makes them act as a semicoherent shared being, and we call that shared being "a friendship", "a romance", "a family", "a party", "a talk", "an event", "a company", "a coop", "a neighborhood", "a city", etc etc etc. each of these will have a different level of membrane defense depending on how much the participants in the thing act to preserve it. in each case, we can recognize some unreliable pattern of membrane defense. typically the membrane gets stretched through communication tubes, I think? consider how loss of internet feels like something cutting a stretched membrane that was connecting you.

...this is why I'd expect to need an integral over possible membranes, and the thing we'd need to do to make this perspective useful is find some sturdy, causal-graph information-theoretic metric that reliably identifies agents. compare discovering agents

this seems like an obvious consequence of not getting to specify "organism" in any straightforward way; we have to somehow information theoretically specify something that will simultaneously identify bacteria and organisms, and then we need some sort of weighting that naturally recognizes individual humans. those should end up getting the majority of the weight in the integral, of course, but it shouldn't need to be hardcoded.

Oh.

But why shouldn't it be hardcoded? 

well maybe it can be as a backstop but what about, idk, dogs? or just humans that aren't in the protected group, eg people outside a state?

More generally, what property of the system inside the membrane do we want to assert about?

Hm right now I only see asserting properties about the state of the membrane, and not about anything inside

I conjecture that someone will be able to prove that in expectation over properties of the membrane (call the random variable P), properties P you wish to assert about the state of the membrane without reference to the inside of the membrane are strongly probably either insufficient, and therefore allows adversaries to "damage" the insides of the membrane, or the given P is overly constraining and itself "damages" the preferences of the being inside the membrane; where "damage" means "move towards a state dispreferred by a random variable U of what partial utility function the inside of the membrane implies". this is a counting argument, and those have been taking some flak lately, but my point is that we need to do better than simplicity-weighted random properties of the membrane.

What about agents implemented entirely as distributed systems, such as a fully remote company, such that the only coherent membrane you can point to is a "level down", the bodies of the agents participating in the company?

See my recent post where I talk about membranes in cybersecurity in one of the sections

Thread of revisions to this post. (This post was originally published on 2024 Jan 3.)

Today, I revised it to be much more clear and complete.

(I don't follow the relationship. clarify or don't at your whim)

essentially also "membranes for safety" idea but Critch takes it a broader and more civilizational

This comment has many good questions. More generally, I suspect that for any given membrane definition, it would be relatively easy to do either or both:

A - specify multiple easily-stated ways to torture or destroy the agent without piercing the membrane; and/or

B - show that the membrane definition is totally unworkable and inconsistent with other similarly-situated agents having similar membranes.

B is there because you could get around A by saying absurd things like 'well my membrane is my entire state, if nobody pierces that then I will be safe.' If you do, then people will of course need to pierce that membrane all the time, many agents' membranes will constantly be overlapping, and the 'membrane' framework just reduces to some kind of 'implied consent' framework, at which point the 'membrane' isn't doing any work.

I suspect it's not a coincidence that this post focuses on 'membranes' in the abstract rather than committing to any particular conception of what a membrane is and what it means to pierce it. I claim this is because there cannot actually exist any even reasonably precise definition of a 'membrane' that both (a) does any useful analytical work; and (b) could come anywhere close to guaranteeing safety. 

any given

Maybe for most, but I don't know if we can confidently say "forall membrane" and make the statement you follow up with. Can we say anything durable and exceptionless about what it looks like for there to be a membrane through which passes no packet of information that will cause a violation, but which allows packets of information which do not cause a violation? Can we say there isn't? you're implying there isn't anything general we can say, but you didn't make a locally-valid-step-by-step claim, you proposed a statement without a way to show it in the most general, details-erased case.

absurd things like

whether it's absurd is yet to be shown, imo, though it very well could be

well my membrane is my entire state, if nobody pierces that then I will be safe

well, like, we're mostly talking about locality here, right? it doesn't seem to be weird to say it has to be your whole state. but -

people will of course need to pierce that membrane all the time

right, the thing that we have to nail down here is how to derive from a being what their implied ruleset should be about what interactions are acceptable. compare the immune system, for example. I don't think we get to avoid doing a CEV, but I think boundaries are a necessary type in defining CEVs, because --

many agents' membranes will constantly be overlapping

this is where I think things get interesting: I suspect that any reasonable use of membranes as a type is going to end up being some sort of integral over possible membrane worldsheets or something. in other words, it's an in-retrospect-maybe-trivial-but-very-not-optional component of an expression that has several more missing parts.

I realized I might not have been clear above, by "state" I meant "one of the fifty United States", not "the set of all stored information that influences an an Agent's actions, when combined with the environment". I think that is absurd. I agree it hasn't been shown that the other meaning of "state" is an absurd definition.