«Boundaries», Part 3b: Alignment problems in terms of boundaries

Andrew_Critch

AI alignment is a notoriously murky problem area, which I think can be elucidated by rethinking its foundations in terms of boundaries between systems, including soft boundaries and directional boundaries. As anticipated previously in this «Boundaries» sequence, I'm doing that now, for the following problem areas:

Preference plasticity & corrigibility
Mesa-optimizers
AI boxing / containment
(Unscoped) consequentialism
Mild optimization & impact regularization
Counterfactuals in decision theory

Each of these problem areas probably warrants a separate post, from the perspective of making research progress within already-existing framings on the problem of aligning AI technology with humanity. However, with this post, my goal is mainly just to continue conveying the «boundaries» concept, by illustrating how to start using it in a bunch of different problem areas at once. An interest score for each problem area was determined by voting on an earlier LessWrong comment, here, and the voting snapshot is here.

You many notice that throughout this post that I've avoided saying things like "the humans prefer that {some boundary} be respected". That's because my goal is to treat boundaries as more fundamental than preferences, rather than as merely a feature of them. In other words, I think boundaries are probably better able to carve reality at the joints than either preferences or utility functions, for the purpose of creating a good working relationship between humanity and AI technology.

Alright, let's begin by analyzing:

Preference Plasticity & Corrigibility

Preference plasticity is the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).
Interest score: 12/12
Corrigibility is the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).
Interest score: 3/12

I think these two problems are best discussed together, because they are somewhat dual to each other: corrigibility has to do with the desirability of humans making changes to AI systems, and preference plasticity has to do with the undesirabiltiy — or least confusingness — of AI make changes to humans, or sometimes humans making changes to each other or themselves.

Preference plasticity

When is it good to change your preferences based on experience? When is it bad? Do these questions even make sense? And what about changing the preferences of others?

Allowing someone or something else to change your preferences is a profound kind of vulnerability, and in particular is a kind of boundary opening. How should it work, normatively speaking?

Contrast preferences with beliefs. Beliefs have nice rules for when they should update, e.g., Bayes' theorem and logical induction. If we had similar principles for how preferences should update, we could ask AI to respect those principles, or at to least help us uphold them, in the process of affecting our preferences. But from where could we derive rules for "ideal preference updating", other than just asking our preferences what rules we prefer?

Well, for one thing, preferences updates are a mechanism for cooperation: if two agents share the same preferences over the state of the world it's easier for them to get along. Does this happen in humans? I think so. Does it happen optimally? Probably not.

Consider the adage: "You are the average of the N people closest to you" (or similar). I don't know of any satisfying comprehensive study of this kind of claim, and it's probably even wrong in important ways (e.g., by neglecting dynamics around people who are low on the 'agreeableness' Big 5 personality trait). Nonetheless, I think a lot of what causes society to hang together as a surviving and functioning system, to the extent that it does, is that

People's beliefs tend to update to match or harmonize with the beliefs of others near them, and
People's preferences tend to update to match or harmonize with the preferences of near them.

Point 1 can be normatively derived from by Bayes' theorem and/or logical induction, and sharing of evidence and/or computations. In reality though, I think people often just believe stuff because people nearby them believe that stuff, rather than thinking "Oh, Alice believes X, and I can infer that Alice is the kind of person who knows when things like X are true, so X is probably true."

In other words, I think a lot of beliefs just kind of slosh around in the social soup, flowing into and out of people in a somewhat uncontrolled fashion. I think preferences and moral judgements just kind of slosh around between people in a similar way; perhaps the Ash conformity experiments are a good example of this.

How does all this uncontrolled sloshing around manage to survive while not obeying Bayes' theorem or even having an analogue in mind for how preference updates should work? It's kind of horrific from the perspective of trying to cast humans in the role of individually rational agents, and in some ways LessWrong itself exists as a kind of horrified reaction to at all that unfiltered and unprincipled sloshing around of ideas.

How do you even industrial-revolution if your civilization's beliefs are all sloshy and careless like that?

I think the answer to that question and the following question are related:

How can human embryo cells do something as complicated as "build a human body" while just being sloshy bags of water and protein gradients?

Tufts Professor Michael Levin has made an incredibly deep study of how gap junction openings between cells enable cooperation between cells, decision-making at the scale of groups of cells, and even mediating the formation and destruction of cancer.

Gap cell junction-en.svg — Diagram of gap junction openings between cell boundaries; source: Wikipedia.

Levin's work is worthy of a LessWrong sequence all to itself; if you have at least 90 minutes left to live, at least watch these two presentations:

I think if we can understand an abstract version of principles underlying embryology — specifically, the pattern of boundary opening and closing and construction that allows the cells of an embryo to build and become a functioning whole — it should shed light on how and when, normatively speaking, humans should and should not allow their preferences and other mental content to just flow in and out of themselves through social connections.

In other words, preference plasticity seems to me like a feature, rather than a bug, in how humans cooperate. This also relates to corrigiblity for AI, because humans are somewhat corrigible to other via preference plasticity, and are thus an interesting naturally occurring solution.

Corrigibility

Thinking in terms of boundaries, corrigibility of humans and preference plasticity of humans under outsider influence are very similar properties.

Corrigibility requires an AI system to do things:

The AI must allow humans to reach into its internal processes and make changes to how it's working, such as by shutting the system down, stopping one of its actions prior to execution, or rewriting some of its code. These are instances of humans crossing boundaries into the processing of the AI system.
The AI must not interfere too much with a human's thinking about whether to correct the system. This amounts to the AI system respecting a boundary around the human's ability to independently make decisions about correcting it, and not violating that boundary by messing with the human's perceptions, actions, or viscera (thoughts) in a way that diminishes the human's autonomy as an agent to decide to make corrections. (Thanks to Scott Garrabrant for suggesting to include this part.)

Thus, an incorrigible AI system is one which maintains a boundary around its processing that is too-well-defended for humans to effectively pass through it.

By contrast, a corrigible AI system "opens up" its mental boundary for humans to pass through and make changes, in turn "making itself vulnerable". Humans often say things like "open up" or "make yourself vulnerable" when they are trying to facilitate change in someone who is steadfastly defending something.

Therefore, a solution to corrigibility is one that prescribes how and when an AI system should open up its own protective boundary.

Mesa-optimizers

... instances of learned models that are themselves optimizers, which give rise to the so called inner alignment problem (Hubinger et al, 2019).

Interest score: 9/12

The way I think about mesa-optimizers, there are three loops running:

An innermost loop, "execution" of an ML system at runtime;
A middle loop, "training", and
An outer loop, "value loading" or "reward engineering", where humans try different specifications of what they want.

Within this framing, the alignment problem is the observation that, if you run 1+2 really hard (training & executing ML) without a good enough running of loop 3 (value loading), you get into trouble where the inner loops learn to "break the abstraction" that the outer loop was "trying" to put into place. Explaining this warrants a digression into abstractions as boundary features.

Abstractions as boundary features

Recall the following diagram from Part 3a:

When making decisions, an organism's viscera makes use of a simplified representation of the external environment $E_{∙}$ , specifically, the effect of the environment on the passive (or "perceptive") boundary component $P_{∙}$ . In other words, any decisions by the organism involves ignoring a ton of details about the environment. Abstraction, in this framing of the world, is the process of ignoring details in a manner that continues to enable a description of the world as a lower-dimensional Markov chain than the world actually is. In particular, the organism's model of the state of world $w \in W$ from Part 3a — if the organism has such a model — will correspond to some state $w^{'} \in W^{'}$ in some lower dimensional space $W^{'}$ , which we can associate with a map $A b s t r : W \to W^{'}$ . For this abstraction (i.e., detail-ignoring) process to be useful for organism's predictions of the world, there needs to also be some transition function $T_{W}^{'} : W^{'} \to Δ W^{'}$ that approximately commutes with $A b s t r$ and the true transition function $T_{W} : W \to Δ W$ , i.e.,

(abstraction accuracy) $T_{W}^{'} (A b s t r (w))) \approx A b s t r (T_{W} (w))$ , read, "making predictions from the abstraction must agree the abstract prediction must approximate reality's transition function".

(This approximate equation is closely related to what Yann Lecun calls JEPA on pages 27-28 of his position paper, "A Path Towards Autonomous Machine Intelligence", whereby an intelligence learns to ignore certain details of reality that are hard for it to predict, and focus on features that it can predict.)

This can be visualized as the following causal diagram:

Figure 3: abstract representations of the world are most useful when they can be played forward in time without much need for further details for the (non-abstracted) state of the world. Gray arrows represent minimal or non-existent causal influence, i.e., a boundary.

Why does this matter? I claim humans use abstracted world representations like $(A b s t r, T_{W}^{'})$ all the time when we think, and if an AI starts behaving in a way that "breaks our abstractions" — i.e., destroying the accuracy of abstraction approximation above — then the AI breaks our ability to select decisions for the impacts we want to have on the world. Breaking those abstractions means converting the grey arrows in the above figure into solid arrows: a kind of boundary violation. Very often the yellow nodes (W') will be mostly inside our minds and the blue nodes (W) will be mostly outside our minds, which makes this very similar to crossing the perception/action boundary.

Inner & outer alignment problems

Coming back to mesa-optimization, consider these three loops:

An innermost loop, "execution" of an ML system at runtime;
A middle loop, "training", and
An outer loop, "value loading" or "reward engineering", where humans try different specifications of what they want.

The inner alignment problem is more specifically the observation that Loop 1 can learn to violate the abstract intentions implicit in Loop 2 (i.e., 1 can fail to be aligned with 2), and the outer alignment problem is the observation that Loop 2 can violate the abstract intentions implicit in Loop 3 (i.e., 2 can fail to be aligned with 3).

In my view, these are all downstream of the observation that optimizers that do not specifically respect boundaries will tend to violate those boundaries, and what's needed is some combination of respect-for-boundaries at each level of the hierarchy, including respect for the abstractions of other entities.

AI boxing / Containment

...the method and challenge of confining an AI system to a "box", i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).
Interest score: 7/12

AI boxing is straightforwardly about trying to establish a boundary between an AI system and humanity. So, "boundary theory" should probably have something to say here, and in short, the message is this:

Define boundaries in terms of information flow, not preferences.

Perhaps that's obvious, but some have proposed that boxing should not be necessary if we solve alignment correctly, and that the AI should know to stay in the box simply because we prefer it. However, the point of Post 3a was to show that boundaries are more fundamental than preferences and thus easier to point at. Boundaries are information-theoretic and more objective in that they are (often) inter-subjectively observable just by counting bits of mutual information between variables, whereas preferences a subjective and observable only indirectly through behavior.

(Incidentally, the the fact that boundaries are inter-subjectively-visible is also the main reason I expect them to play a special role in bargaining and social contracts between agents, as described in Post 1.)

(Unscoped) Consequentialism

... the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).
Interest score: 7/12

In short, there's a version of consequentialism that I'd like to call scoped consequentialism that I think is

much less problematic for AI safety than (unscoped) consequentialism,
more adaptive than deontology or virtue ethics, and
naturally conceived in terms of boundaries.

Scoped consequentialism defined

Consequentialism refers to taking actions on the basis of their consequences, rather than on the basis of other considerations like whether a "good process" is followed to decide or execute the actions. In other words, consequentialism corresponds to an "ends justify the means" philosophy of decision-making, which has many problems of which I'm sure readers of this blog will be aware. Consequentialism is usually contrasted with deontology, which treats rules as more fundamental than consequences (source: Wikipedia).

Rule consequentialism (source: Stanford Encyclopedia) is a bit more practical it selects rules based on the goodness of their consequences, and then uses those rules to judge the moral goodness of actions . Rule consequentialism is basically just deontology where the rules are chosen to have positive effects when followed by everyone.

Scoped consequentialism is meant to be somewhere between pure consequentialism and rule consequentialism. The everyday responsibilities of a human being, I claim, are best described by a compromise between the two. Many real-world jobs are defined — when defined in writing at all — by a scope of work (search: Google), which defines a mix of

(the goal) consequences to be achieved or maintain with the work, and
(the scope) features of the world to be considered and used in service of the work.

The scope is not just a constraint on the outcome; it's a constraint on the process that achieves it, sometimes even including the cognitive aspects the process (what you're responsible for thinking about or considering vs not responsible or not supposed to think about). It may be tempting to try wrapping the scope and the goal all into one objective function (e.g., using Lagrangian duality), but I think that's a mistake, for reasons I'll hopefully explain, in terms of boundaries!

Meanwhile, in one sentence, I'd say a scoped consequentialist agent is one that has both a goal and a scope, and reasons within its scope to choose actions within its scope that are effective for achieving the goal.

Electrical repairs as an example scope of work

Consider the case of an electrician doing repairs on your home. Generally speaking, your home is usually not supposed to be affected much by the outside world except via your decisions. Your electrician is supposed to fix electrical stuff in your home when you ask, but isn't supposed to sneak into your home to unplug the heater in your living room, even if that would help you avoid electrical problems. They're probably not even allowed inside your house unless you say so (or your landlord says so, if your agreement with your landlord allows that).

So generally speaking, the relationship between your home and the outside world is kind of like this:

With your electrician, things work kind of like this:

In words: your electrician is allowed to affect your home, and other aspects of your life in general, if they do so via electrical repairs on your home that you've consented to. Thus, by default you yourself serve as a boundary between your home and your electrician, and when you open up that boundary for the purpose of electrical repairs, the repairs on your home are supposed to be a boundary between your electrician and the other aspects of your life.

... these are all very approximate supposed-to's, of course, which is why boundaries were defined as approximate in Part 3a of this sequence. If you're on the phone with your mom talking about Game of Throne, and your electrician overhears and chimes in "Hey, working on the Wall is underrated!", you don't have to call the electrical repair company and say they violated a boundary by engaging in activities outside their scope of work. You can laugh. It's okay. It was just a joke. Geez.

In fact, there's a kind of comfort that comes with crossing boundaries just a little bit and seeing that it's okay (when it actually is).

On the other hand, if your electrician figures out a way to hack your broken thermostat to send messages to your mom from your gmail, you'll feel pretty weird about it. Yes they're only directly affecting electrical stuff, but the emails aren't contributing to the purpose of the work at hand — fixing the thermostat. Even if the email to your mom was electrical-related, like "Hey there, could you pick up some new wire cutters and bring them over?", the electrician would still be violating a couple of other boundaries, like how your thermostat isn't supposed to send emails, and your electrician isn't supposed to send emails from your email account.

Ultimately, we want AI to be able to help us out in a scoped way, like the electrician, without invading all of our boundaries and controlling our thoughts and such. To the extent that scopes are natural boundaries, progress on characterizing natural boundaries could be helpful to "scoping" AI so that it's not purely consequentialist (or even rule-consequentialist).

Mild Optimization & Impact Regularization

Mill optimization is the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).
Interest score: 7/12
Impact regularization is the problem of formalizing "change to the environment" in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).
Interest score: 5/12

I think these two problems are best treated together:

mild optimization means an AI knowing when the work it's done is "enough", and
impact regularization is a technique for ensuring an AI knows when the work it's doing or about to do is "too much".

"Enough", for many tasks, will usually mean "enough to sustain the functioning of some living or life-supporting system" as operationalized in Boundaries Post 3a, e.g.,

enough food for a person,
enough money for a business, or
enough repairs on a highway to avoid damaging the cars and bodies of people riding on it (here the cars and people are autopoietic).

"Too much" will often mean violating the boundaries of an existing living or life-supporting system, e.g.,

encroaching on a person's autonomy, privacy, or freedom (all of which can be operationalized via approximate directed Markov blankets as in Boundaries Post 3a)
moving money from a one company into another without permission of the relevant stakeholders (boundaries between accounts can be operationalized as approximate directed Markov blankets);
destroying a beautiful park to create a new road or detour (the park boundary can also be operationalized as an approximate directed Markov blanket).

Counterfactuals in Decision Theory

... the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner's Dilemma (Yudkowsky & Soares, 2017).
Interest score: 6/12

Defining a counterfactual in a dynamical system means picking out a part of the system and saying "What if this part were different, and everything worked as usual?" The definition of living system from Part 3a already includes in it the part of the world that's meant to swapped out if a different decision is made, namely, the active boundary component (A). Decision theory gets tricky/interesting precisely when boundaries don't work the way one normally expects.

For example, in the Twin Prisoner's Dilemma (TPD), because the twins are presumed to take the same action, there is no Markov blanket around Twin 1 to make Twin 1 fully independent of her environment (which contains Twin 2). If you're a twin in the TPD, you need to realize the objective fact that the insides and actions of your twin need to be modeled as the same organism as you for that decision, so "you" are controlling both decisions at once, just as Yudkowsky's Functional Decision Theory would prescribe.

Summary: if you draw your approximate directed Markov blankets correctly, many (but probably not all) decision theory problems become more straightforward.

Recap

In this post, I reviewed the following problem areas in terms of boundaries:

Preference plasticity & corrigibility
Mesa-optimizers
AI boxing / containment
(Unscoped) consequentialism
Mild optimization & impact regularization
Counterfactuals in decision theory

Each of these problems probably warrants its own post, but my main goal here was to just to convey how the «boundaries» concept can be applied in a fundamental way to many different areas. In particular, I tried to avoid saying things like "the humans prefer that {some boundary} be respected", because my goal here is to explore a treatment of treat boundaries as more fundamental and intersubjectively meaningful than preferences or utility functions.

For future posts, I have a few ideas but I haven't yet decided what's next in the sequence :)

[-]DanielFilan4yΩ4810

As I understand it, the EA forum sometimes idiosyncratically calls this philosophy [rule consequentialism] "integrity for consequentialists", though I prefer the more standard term.

AFAICT in the canonical post on this topic, the author does not mean "pick rules that have good consequences when I follow them" or "pick rules that have good consequences when everyone follows them", but rather "pick actions such that if people knew I was going to pick those actions, that would have good consequences" (with some unspecified tweaks to cover places where that gives silly results). But I'm not familiar with the use of the term on the EA forum as a whole.

[-]Andrew_Critch4yΩ120

Ah, thanks for the correction! I've removed that statement about "integrity for consequentialists" now.

[-]Alex_Altair3y31

Boundaries are information-theoretic and more objective in that they are (often) inter-subjectively observable just by counting bits of mutual information between variables, whereas preferences a subjective and observable only indirectly through behavior.

This feels wrong to me, but the feeling is a little fuzzy. Maybe I disagree with the emphasis or the framing, or something.

Preferences are instantiated in the world just like anything else. And they might be entirely visible; there's no reason you couldn't have a system whose preferences are on display (like reading the source code of a program). But I would grant that they're not usually on display, especially for humans, and so you do have to infer them through behavior (although that can include the person telling you what they are, or otherwise agreeably taking actions that are strong evidence of their preferences).

In contrast, boundaries are usually evident from the outside, even if they obscure what's happening on the inside. And if preferences are not directly visible, then they're probably inside a boundary, in which case the boundary is easier to detect and verify than the preferences.

[-]DanielFilan4yΩ220

In reality though, I think people often just believe stuff because people nearby them believe that stuff

IMO, a bigger factor is probably people thinking about topics that people nearby them think about, and having the primary factors that influence their thoughts be the ones people nearby focus on.

[-]Andrew_Critch4yΩ231

I agree this is a big factor, and might be the main pathway through which people end up believing what people believe the believe. If I had to guess, I'd guess you're right.

E.g., if there's a evidence E in favor of H and evidence E' against H, if the group is really into thinking about and talking about E as a topic, then the group will probably end up believing H too much.

I think it would be great if you or someone wrote a post about this (or whatever you meant by your comment) and pointed to some examples. I think the LessWrong community is somewhat plagued by attentional bias leading to collective epistemic blind spots. (Not necessarily more than other communities; just different blind spots.)

[-]Chris Lakin3y10

This reminds me of Counterfactual Harm https://arxiv.org/pdf/2204.12993.pdf where the authors define harm to Agent 1 by Agent 2 as the counterfactual of Agent 2's actions. However, this also requires defining what the acceptable "default action" is. For example, one couldn't expect a mule farmer to save the life of someone having a heart attack, and so the mule farmer hasn't done "harm" if they didn't help successfully, but we would expect a doctor to help, and they have harmed if they haven't helped.

However, they also admit:

we do not provide a method for determining the desired default action or policy in general

I believe that «membranes» (what Critch calls «boundaries») can provide these defaults.

I think it might be possible to determine moral defaults from a simple premise: "it’s unworkable to rely on forcing anyone to do anything that they haven’t agreed to". In which case, if Alice can’t control Bob, then all she can do is “mind her own business”. She may want to control others, but she can’t.

Put another way: There are things that only Bob can do that Alice cannot meddle in.

(This can then be formalized in terms of Markov blankets.)

For example, I cannot control your actions and I cannot observe your subjective experience, and you cannot me. I call this fact “individual sovereignty”.

And I think individual sovereignty is the default. I think this is then where the most fundamental moral “defaults” come from.

Of course, we can make extra agreements with others on top of that, but crucially there is a finite number of limited-scope social attracts that are ~explicitly added on top of that default.

For example, a patient and a doctor enter into a contract where the doctor agrees to give service, and the patient agrees to pay. This contract is then also enforced by a larger force (like the government) that enforces other contracts, like the one that says they will send the doctor to jail if he breaks the law.

Social contracts can add on top of the individual sovereignty default, albeit to a limited extent.

Another example: Duty to Rescue laws obligate not that you “mind your own business”, but that you actively try to save people near you in trouble. But everyone “agrees” to it by living in their society, so it works.

The above should address the doctor and mule farmer examples.

In sum: I think "individual sovereignty (baseline) + finite social contracts (extra, subjective)” is enough to (fully?) determine moral defaults

Put another way: “Never expect to be able to force anyone to do anything, except when they’ve agreed”

Of course, it would be nice to live in a world where everyone helps others as much as they can all the time, but that violates the premise and I think it is unworkable. (Though, in the morality literature it doesn’t seem uncommon to assume that you, the ethicist, get to decide what other people do, AFAICT?)

I've compiled all of the current «Boundaries» x AI safety thinking and research I could find in this post: «Boundaries» and AI safety compilation. Also see: «Boundaries» for formalizing a bare-bones morality which relates to scoped consequentialism

72

«Boundaries», Part 3b: Alignment problems in terms of boundaries

72

Ω 30

Preference Plasticity & Corrigibility

Preference plasticity

Corrigibility

Mesa-optimizers

Abstractions as boundary features

Inner & outer alignment problems

AI boxing / Containment

(Unscoped) Consequentialism

Scoped consequentialism defined

Electrical repairs as an example scope of work

Mild Optimization & Impact Regularization

Counterfactuals in Decision Theory

Recap

72

Ω 30

72

Ω 30