Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We have a computational graph (aka circuit aka causal model) representing an agent and its environment. We’ve chosen a cut through the graph to separate “agent” from “environment” - i.e. a Cartesian boundary. Arrows from environment to agent through the boundary are “observations”; arrows from agent to environment are “actions”.

 

Presumably the agent is arranged so that the “actions” optimize something. The actions “steer” some nodes in the system toward particular values.

Let’s highlight a few problems with this as a generic agent model…

Microscopic Interactions

My human body interfaces with the world via the entire surface area of my skin, including molecules in my hair randomly bumping into air molecules. All of those tiny interactions are arrows going through the supposed “Cartesian boundary” around my body. These don’t intuitively seem like “actions” or “observations”, at least not beyond some high-level observations of temperature and pressure.

In general, low-level boundaries will have lots of tiny interactions crossing them which don’t conceptually seem like “actions” or “observations”.

Flexible Boundaries

When I’m driving, I often identify with the car rather than with my body. Or if I lose a limb, I stop identifying with the lost limb. (Same goes for using the toilet - I’ve heard that it’s quite emotionally stressful for children during potty training to throw away something which came from their physical body, because they still identify with it.)

In general, it’s ambiguous what Cartesian boundary to use; our conceptual boundaries around an “agent” don’t seem to correspond perfectly to any particular physical surface.

An Agent Optimizing Its Own Actions

I could draw a supposed “Cartesian boundary” around a rock, and declare all the interactions between the rock and its environment to be “actions” and “observations”. If someone asks what the rock is optimizing, I’ll say “the actions” - i.e. the rock “wants” to do whatever it is that the rock in fact does.

In general, we intuitively conceive of “agents” as optimizers in some nontrivial sense. Optimizing actions doesn’t cut it; we generally don’t think of something as an agent unless it’s optimizing something out in the environment away from itself.

Solution: Optimization At A Distance

Let’s solve all of these problems in one fell swoop.

We’ll start with the rock problem. One natural answer is to declare that we’re only interested in agents which optimize things “far away” from themselves. What does that mean? Well, as long as we’re already representing the world as a computational DAG, we might as well say that two chunks of our computation DAG are “far apart” when there are many intermediating layers between them. Like this:

 

If you’ve read the Telephone Theorem post, it’s the same idea.

For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)

This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent. If I'm a reinforcement learner optimizing for some reward I’ll receive later, that later reward is still typically far away from my current actions. My actions impact the reward via some complicated causal path through the environment, acting through many intermediate layers.

So we’ve ruled out agents just “optimizing” their own actions. How does this solve the other two problems?

Abstract Summaries

We’re using the same kind of model and the same notion of “far apart” as the Telephone Theorem, so we can carry that theorem over. The main takeaway is that far apart things interact only via a typically-relatively-small “abstract summary”. This summary consists of the information which is arbitrarily well conserved as it propagates through the intermediate layers.

Because the agent only interacts with the far away things-it’s-optimizing via a relatively-small summary, it’s natural to define the “actions” and “observations” as the contents of the summary flowing in either direction, rather than all the low-level interactions flowing through the agent’s supposed “Cartesian boundary”. That solves the microscopic interactions problem: all the random bumping between my hair/skin and air molecules mostly doesn’t impact things far away, except via a few summary variables like temperature and pressure.

This redefinition of “actions” and “observations” also makes the Cartesian boundary flexible. The Telephone Theorem says that the abstract summary consists of information which is arbitrarily well conserved as it propagates through the intermediate layers. So, the summary isn’t very sensitive to which layer we declare to be “the Cartesian boundary”; we can move the boundary around quite a bit without changing the abstract “agent” we’re talking about. (Though obviously if we move the Cartesian boundary to some totally different part of the world, that may change what “agent” we’re talking about.) If we want, we could even stop thinking of the boundary as localized to a particular cut through the graph at all.

Aside: Dynamic Programming

When Adam Shimi first suggested to me a couple years ago that “optimization far away” might be important somehow, one counterargument I raised was dynamic programming (DP): if the agent is optimizing an expected utility function over something far away, then we can use DP to propagate the expected utility function back through the intermediate layers to find an equivalent utility function over the agent’s actions:

This isn’t actually a problem, though. It says that optimization far away is equivalent to some optimization nearby. But the reverse does not necessarily hold: optimization nearby is not necessarily equivalent to some optimization far away. This makes sense: optimization nearby is a trivial condition which matches basically any system, and therefore will match the interesting cases as well as the uninteresting cases.

(Note that I haven’t actually demonstrated here that optimization at a distance is nontrivial, i.e. that some systems do optimize at a distance and others don’t; I’ve just dismissed one possible counterargument. I have several posts planned on optimization at a distance over the next few weeks, and nontriviality will be in one of them.)

Mental Picture

I like to picture optimization at a distance like a satellite dish or phased array:

Lots of little “actions” produce a strong coherent influence, which can propagate far away to impact the optimization target.

In a phased array, lots of little antennas distributed over an area are all controlled simultaneously, so that their waves add up to one big coherent wave which can propagate over a long distance. Optimization at a distance works the same way: there’s lots of little actions distributed over space/time, all controlled in such a way that their influence can add up coherently and propagate over a long distance to optimize some far-away target.

75

Ω 40

13 comments, sorted by Click to highlight new comments since: Today at 4:35 PM
New Comment

Embedded agents have a spatial extent. If we use the analogy between physical spacetime and a domain of computation of environment, this offers interesting interpretations for some terms.

In a domain, counterfactuals might be seen as points/events/observations that are incomparable in specialization order, that is points that are not in each other's logical future. Via the spacetime analogy, this is the same as the points being space-like separated. This motivates calling collections of mutually counterfactual (incomparable) events logical space, in the same sense as events comparable in specialization order follow logical time. (Some other non-Frechet spaces would likely give more interesting space-like subspaces than a domain typical for program semantics.)

An embedded agent extant in logical space of an environment (at a particular time) is then a collection of counterfactuals. In this view, an agent is not a specific computation, but rather a collection of possible alternative behaviors/observations/events of an environment (resulting from multiple different computations), events that are counterfactual to each other. The logical space an agent occupies comprises the behaviors/observations/events (partial-states-at-a-time) of possible environments where the agent has influence.

In this view, counterfactuals are not merely phantasmal decision theory ideas developed to make sure that reality doesn't look like them, hypothetical threats that should never obtain in actuality. Instead, they are reified as equals to reality, as parts of the agent, and an agent's description is incomplete without them. This is not as obvious as with parts of a physical machine because usually each small part of a machine doesn't contain a precise description of the whole machine. With agents, an actual agent suggests quite strongly what its counterfactual behaviors would be in the adjacent possible environments, at least given a decision theory that interprets such things. So this resembles a biological organism where each cell has a blueprint for the whole body, each expression of counterfactual behavior of an embedded agent has the whole design of the agent sufficient to reconstruct its behavior in the other counterfactuals. But this point of view suggests that this is not a necessary property of embedded agents, that counterfactuals might have independent content, other parts of a larger design.

For counterfactuals in decision theory, this cashes out as imperfect ability of an agent to know what it does in counterfactuals, or as coordination with other agents that have different designs in different counterfactuals, acausal trade across logical space. So there is essentially nothing new, the notion of "logical space" and of agents having extent in logical space adds up to normality, extending the title of a singular "agent" to a collective of agents with different designs that are mutually counterfactual and are engaged in acausal trade with each other, parts of the collective. It is natural to treat different parties engaged in acausal trade as parts of a whole since they interact and influence each other's behavior. With sufficient integration, it becomes more central to call the whole collective "an agent" instead of privileging views that only focus on one part (counterfactual) at a time.

Logical space is an unusual notion of counterfactuals, because different points of a logical space can have a common logical future, that is different counterfactuals can contribute to the same future logical event, be in that event's past. This is not surprising given acausal trade and predictors that ask what a given agent/computation does in multiple counterfactual situations. But it usefully runs counter to the impression that counterfactuals necessarily irrevocably diverge from each other, embed a mutual contradiction that prevents them from ever being reunited in a single possibility.

If someone asks what the rock is optimizing, I’ll say “the actions” - i.e. the rock “wants” to do whatever it is that the rock in fact does.

This argument does not seem to me like it captures the reason a rock is not an optimiser? 

I would hand wave and say something like: 

"If you place a human into a messy room, you'll sometimes find that the room is cleaner afterwards. If you place a kid in front of a bowl of sweets, you'll soon find the sweets gone. These and other examples are pretty surprising state transitions, that would be highly unlikely in the absence of those humans you added. And when we say that something is an optimiser, we mean that it is such that, when it interfaces with other systems, it tends to make a certain narrow slice of state space much more likely for those systems to end up in."

The rock seems to me to have very few such effects. The probability of state transitions of my room is roughly the same with or with out a rock in a corner of it. And that's why I don't think of it as an optimiser. 

Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.

A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

I like this framework, but I think it's still a bit tricky about how to draw lines around agents/optimization processes.   

For instance, I can think of ways to make a rock interact with far away variables by e.g., coupling it to a human who presses various buttons based on the internal state or the rock. In this case, would you draw the boundary around both the rock and the human and say that that unit is "optimizing"? 

That seems a bit weird, given that the human is clearly the "optimizer" in this scenario.  And drawing a line around only the rock or only the human seems wrong too (human is clearly using the rock to do this strange optimization process and rock is relying on the human for this to occur). Curious about your thoughts. 

Also, I'm not sure that agents always optimize things far away from themselves. Bacteria follow chemical gradients (and this feels agent-y to me), but the chemicals are immediately present both temporally and spatially. There is some sense in which bacteria are "trying" to get somewhere far away (the maximum concentration), but they're also pretty locally achieving the goal, i.e., the actions they take in the present are very close in space and time to what they're trying to achieve (eat the chemicals). 

Bacteria follow chemical gradients (and this feels agent-y to me), but the chemicals are immediately present both temporally and spatially.

Subtle point here: most of the agenty things which persist over time (like humans or bacteria) are optimizing their own future state, and it's that future state which is far away in space/time from their current decision.

For instance, I can think of ways to make a rock interact with far away variables by e.g., coupling it to a human who presses various buttons based on the internal state or the rock. In this case, would you draw the boundary around both the rock and the human and say that that unit is "optimizing"?

The real answer here is that this post isn't meant to handle that question. Some boundaries are clearly more natural optimizer boundaries than others, but this post is not yet trying to fully say which, it's just laying some groundwork/necessary conditions. One of the necessary conditions which this post does not address is robustness of the optimization to changes in the environment, which is what makes e.g. the rock look like it's not an optimizer.

I cannot find the reference for this despite repeated attempts, but the rock example reminds of a story I once read in a letter from a student describing a lecture Von Neumann gave.

In this lecture, Von Neumann made a reference to thinking of evolution as a universal principle; the gist of it was that if we replace "have many descendants" with "propagate your information into the future" then atoms are excellent from an evolutionary point of view, because most atoms are very stable and therefore very likely to still exist in the future.

So when asked what the rock is optimizing for, I immediately thought of this story, and that the rock is optimizing for being rock in the future and that minimizing interaction with the environment is probably optimal for this purpose.

Great post!

For instance, if I’m planning a party, then the actions I take now are far away in time (and probably also space) from the party they’re optimizing. The “intermediate layers” might be snapshots of the universe-state at each time between the actions and the party. (... or they might be something else; there are usually many different ways to draw intermediate layers between far-apart things.)

This applies surprisingly well even in situations like reinforcement learning, where we don’t typically think of the objective as “far away” from the agent. If I'm a reinforcement learner optimizing for some reward I’ll receive later, that later reward is still typically far away from my current actions. My actions impact the reward via some complicated causal path through the environment, acting through many intermediate layers.

So we’ve ruled out agents just “optimizing” their own actions. How does this solve the other two problems?

I feel like this is assuming away one of the crucial difficulties of ascribing agency and goal-directedness: lack of competence or non optimality might make agentic behavior look non-agentic unless you already have a mechanistic interpretation. Separating a rock from a human is not really the problem; it's more like separating something acting like a chimp but for which you have very little data and understanding, and an agent optimizing to clip you.

(Not saying that this can't be relevant to address this problem, just that currently you seem to assume the problem away)

Because the agent only interacts with the far away things-it’s-optimizing via a relatively-small summary, it’s natural to define the “actions” and “observations” as the contents of the summary flowing in either direction, rather than all the low-level interactions flowing through the agent’s supposed “Cartesian boundary”. That solves the microscopic interactions problem: all the random bumping between my hair/skin and air molecules mostly doesn’t impact things far away, except via a few summary variables like temperature and pressure.

Hmm. I like the idea of redefining action as the consequences of one's action that are observable "far away" — it nicely rederives the observation-action loop through interaction with far away variables. That being said, I'm confused if defining the observations in the summary statistics itself is not problematic. I have one intuition that tells me that this is all you can observe anyway, so it's fine; on the other hand, it looks like you're assuming that the agent has the right ontology already? I guess that can be solved by saying that the observations are on the content of the summary, but not necessarily all of it.

When Adam Shimi first suggested to me a couple years ago that “optimization far away” might be important somehow, one counterargument I raised was dynamic programming (DP): if the agent is optimizing an expected utility function over something far away, then we can use DP to propagate the expected utility function back through the intermediate layers to find an equivalent utility function over the agent’s actions:

This isn’t actually a problem, though. It says that optimization far away is equivalent to some optimization nearby. But the reverse does not necessarily hold: optimization nearby is not necessarily equivalent to some optimization far away. This makes sense: optimization nearby is a trivial condition which matches basically any system, and therefore will match the interesting cases as well as the uninteresting cases.

I think I actually remember now the discussion we were having, and I recall an intuition about counting. Like, there seem to be more ways to optimize nearby than to optimize the specific part of far away, which I guess is what you're pointing at.

In general, low-level boundaries will have lots of tiny interactions crossing them which don’t conceptually seem like “actions” or “observations”.

While this seems obviously true at a low enough level - wherefore art thou, nanotech? - viruses and the like mean that even if systems work well enough most of the time, sometimes they don't keep interactions involving small parts from having big consequences. (Also, what causes cancer aside from smoke and 'genetics'?)

Indeed! This implies that physical smallness is not a perfect correlate of the conceptual "smallness of an interaction".

Really liking this model. It seems to actually deal with the problem of embeddedness for agents and the fact that there is no clear boundary to draw around what we call an agent other than one that's convenient for some purpose.

I've obviously got thoughts on how this is operationalizing insights about "no-self" and dependent origination, but that doesn't seem too important to get into, other than to say it gives me more reason to think this is likely to be useful.

One issue I have with the phased array picture of agency is that it doesn't contain any deep agent -> environment -> agent -> environment -> agent -> ... paths. Handling these paths well is in my view one of the more difficult parts of agency.

Yeah, we can generalize it to something with two-way interaction pretty easily. Then it's more like a combination transmitter/receiver. That's obviously the right way to set things up for agency models.

We both have a similar intuition about the kinds of optimizers we're interested in. You say they optimize things that are "far away", I say they affect "big pieces of the environment". One difference is that I think of big as relative to the size of the agent, but something can be "far away" even if the agent is itself quite large, and it seems that agent size doesn't necessarily matter to your scheme because the information lost over a given distance doesn't depend on whether there's a big agent or a small one trying to exert influence over this distance.

I think agent size (in the sense I'm thinking about it) is mainly relevant from the point of view of "how likely is it for such an agent to come about?" (which suggest something like "large measure, given initial conditions + dynamics" instead of "small size").

Here are some of my thoughts on the issue: https://www.lesswrong.com/posts/me34KqMLwJNYAZKbs/is-evolutionary-influence-the-mesa-objective-that-we-re

I think my scheme needs some distinction between "microstates" and "macrostates" in order to offer a reasonable definition of "big features". Your setup seems to have this fairly naturally in terms of the telephone theorem, though the precise analogy (if there is one) isn't striking me immediately.