Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
It may be possible to use the concept of a causal counterfactual (as formalized by Pearl) to separate some intended effects from some unintended ones. Roughly, "follow-on effects" could be de fined as those that are causally downstream from the achievement of the goal... With some additional work, perhaps it will be possible to use the causal structure of the system's world-model to select a policy that has the follow-on effects of the goal achievement but few other effects.
Taylor et al., Alignment for Advanced Machine Learning Systems

In which I outline a solution to the clinginess problem and illustrate a potentially-fundamental trade-off between assumptions about the autonomy of humans and about the responsibility of an agent for its actions.

Consider two plans for ensuring that a cauldron is full of water:

  • Filling the cauldron.
  • Filling the cauldron and submerging the surrounding room.

All else equal, the latter plan does better in expectation, as there are fewer ways the cauldron might somehow become not-full (e.g., evaporation, and the minuscule loss of utility that would entail). However, the latter plan "changes" more "things" than we had in mind.

Undesirable maxima of an agent's utility function often seem to involve changing large swathes of the world. If we make "change" costly, that incentivizes the agent to search for low-impact solutions. If we are not certain of a seed AI's alignment, we may want to implement additional safeguards such as impact measures and off-switches.

I designed an impact measure called whitelisting - which, while overcoming certain weaknesses of past approaches, is yet vulnerable to

Clinginess

An agent is clingy when it not only stops itself from having certain effects, but also stops you.
...
Consider some outcome - say, the sparking of a small forest fire in California. At what point can we truly say we didn't start the fire?
  • I immediately and visibly start the fire.
  • I intentionally persuade someone to start the fire.
  • I unintentionally (but perhaps predictably) incite someone to start the fire.
  • I set in motion a moderately-complex chain of events which convince someone to start the fire.
  • I provoke a butterfly effect which ends up starting the fire.
Taken literally, I don't know that there's actually a significant difference in "responsibility" between these outcomes - if I take the action, the effect happens; if I don't, it doesn't. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as "under our control" and some as "out of our hands". Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves?

Since we can only blame ourselves, we should take actions which do not lead to side effects. These actions may involve enacting impact measure-preventing precautions throughout the light cone, since the actions of other agents and small ripple effects of ours could lead to significant penalties if left unchecked.

Clinginess arises in part because we fail to model agents as anything other than objects in the world. While it might be literally true that there are not ontologically-basic agents that escape determinism and "make choices", it might be useful to explore how we can protect human autonomy via the abstraction of game-theoretic agency.

To account for environmental changes already set in motion, a naive counterfactual framework was proposed in which impact is measured with respect to the counterfactual where the agent did nothing. We will explore how this fails, and how to do better.

Thought Experiments

We're going to isolate the effects for which the agent is responsible over the course of three successively more general environment configurations: one-off (make one choice and then do nothing), stationary iterative (make choices, but your options and their effects don't change), and iterative (the real world, basically).

Assumptions

  • we're dealing with game-theoretic agents which make a choice each turn (see: could/should agents).
  • we can identify all relevant agents in the environment.
    • This seems difficult to meet robustly, but I don't see a way around it.
  • we can reason counterfactually in a sensible way for all agents.
It is natural to consider extending standard probability theory to include the consideration of worlds which are "logically impossible" (such as where a deterministic Rube Goldberg machine behaves in a way that it doesn't)... What, precisely, are logically impossible possibilities?
Soares and Fallenstein, Questions of Reasoning Under Logical Uncertainty
  • the artificial agent is omniscient - it can perfectly model both other agents and the consequences of actions.
    • We could potentially instead merely assume a powerful model, but this requires extra work and is beyond the scope of this initial foray. Perhaps a distribution model could be used to calculate the action/inaction counterfactual likelihood ratio of a given side effect.
  • we have a good way of partitioning the world into objects and measuring impact; for conceptual simplicity, side effects are discrete and depend on the identities of the objects involved: .
    • This assumption is removed after the experiments.

Formalization

We formalize our environment as a stochastic game .

  • is a set containing the stars of today's experiments: the players, ugh Mann and a Sheen. Note that is not limited to a single human, and can stand in for "everyone else". Most of the rest of these definitions are formalities, and are mostly there to make me look smart to the uninitiated reader. Oh, and for conceptual clarity, I suppose.
  • is the state space.
    • Unless otherwise specified, both and observe the actions that the other took at previous time steps. Suppose that this information is encoded within the states themselves.
  • is the action space. Specifically, the function provides the legal actions for player in state on turn . The no-op is always available. If the variant has a time limit , then .
  • is the transition function .
  • is the payoff function.

Let be the space of possible side effects, and suppose that is a reasonable impact measure. is agent 's policy; let be for the first time steps, and thereafter.

Let be the (set of) effects - both immediate and long-term - that would take place if executes and executes .

The goal: a counterfactual reasoning framework which pinpoints the effects for which is responsible.

One-Off

We first consider a single-turn game ().

Example

Yup, this is about where we're at in alignment research right now.

Approach

should realize that a lot more effects happen if it presses the left button, and should penalize that plan by the difference. This is the aforementioned naive approach: penalizes things that wouldn't have happened if it had done nothing. For the one-turn case, this clearly isolates both the immediate and long-term impacts of 's actions.

Penalized Effects

Stationary Iterative

Both parties act for countably many time steps. This environment is assumed to be stationary: actions taken on previous turns do not affect the availability or effects of later actions. Formally, .

Example

ugh and a again find themselves faced with a slew of dangerous, bad-effect-having buttons. Some take effect the next turn, while others take a while.

Approach

This seems easy, but is actually a little tricky - we have to account for the fact that can change its actions in response to what does. Thanks to stationarity, we don't have to worry about 's selecting moves that depend on 's acting in a certain way. In the counterfactual, we have act as if it had observed execute , and we have actually do nothing.

Penalized Effects

Let denote the actions would select if it observed executing .

Note: the naive counterfactual scheme, , fails because it doesn't account for 's right to change its mind in response to .

Iterative

We're now in a realistic scenario, so we have to get even fancier.

Example

Suppose pushes the vase to the left, and decides to break it. The stationary iterative approach doesn't allow for the fact that can only break the vase if already pushed it. Therefore, simulating 's inaction but 's action (as if had pushed the vase) results in no vases being broken in the counterfactual. The result: penalizes itself for 's decision to break the vase. Chin up, !

Approach

How about penalizing

Pretty, right?

Do you see the flaw?


Really, look.


The above equation can penalize for side effects which don't actually happen. This arises when interrupting causes side effects which would otherwise have been prevented by later parts of the plan. For example, if I push a vase off the table and then catch it (being sure that I could do so in time), I didn't cause a side effect.

We should instead

Every turn, calculates the effects that only result if it acts (the two simulations run to the longest time step observed under the full plan).

Those effects which are exact matches to effects actually observed in that time step in the final outcome are penalized. The motivation here is that if it's a side effect of 's, it had to be caused by one of its actions. Furthermore, 's side effects are counted exactly once (and not just because we're using sets right now).

The previous solutions are special cases of this formulation.

Note: the number of counterfactual simulations grows as - crucially, not with the number of agents represents.

Applications to Whitelisting

Class-ic

Here, we remove the constraint that side effects be identified by "object identifiers", allowing like side effects to be treated as exchangeable. It is then trivial to implement probabilistic class-based whitelisting with the iterative counterfactual penalty using basic vector arithmetic, , and . I don't want to bore the reader with the details, but I'm fairly confident this can be done rather easily.

Latent Spaces

We now do away with the assumption of discrete side effects. Because we're dealing with exact counterfactuals (by assumption) and because side effects either take place in the actual outcome or they don't, we can extract the relevant step-wise latent space transitions via the iterative formulation. We then penalize only these effects.

Observe that the iterative formulation doesn't allow penalty avoidance by "blending in" with counterfactual effects. Imagine that the default outcome is a large asteroid striking the Earth. only escapes penalization if it has the impact on the same time step and in the same way and on the same objects... Basically, it can't escape penalization in this manner under the omniscience assumption. In short, isn't incentivized to have extra impact because it would "happen anyways".

Fundamental Trade-Off

Fortunately, this formulation solves clinginess; 's effects are not penalized, while 's effects are. Unfortunately, insofar as may be manipulated by , the impact measure becomes useless; can simply avoid penalization by making do its dirty work.

This can happen in two ways:

  • If is separated from by a Cartesian boundary, can threaten with large negative utility to get to carry out the desired plan.
  • If and are not separated by such a boundary, more direct coercion is possible.

I don't think that this is a flaw in my formulations; instead, it appears that impact measures exist on a continuum:

  • If we grade with respect to the naive counterfactual, we have take responsibility for all side effects; to that effect, is incentivized to deprive other agents of their autonomy via stasis. Using no counterfactual makes the problem even worse.
  • Conversely, if we use the iterative counterfactual formulation and implicitly free of penalty for the effects of 's actions, we assume that is incorruptible.

Note that an aligned (seems to) stay aligned under this formulation, safeguarding object status against other agents only so far as necessary to prevent interruption of its (aligned) plans. Furthermore, any separated from an with a known-flat utility function also gains no incentives to mess with (beyond the existing convergent instrumental ones).

In general, unaligned stay basically unaligned due to the workarounds detailed above.

Forwards

It isn't clear that penalizing the elimination of would be helpful, as that seems hard to do robustly; furthermore, other forms of coercion would remain possible. What, pray tell, is a non-value-laden method of differentiating between " makes break a vase at gunpoint" and " takes an action and decides to break a vase for some reason"? How do we robustly differentiate between manipulative and normal behavior?

I'm slightly more pessimistic now, as it seems less likely that the problem admits a concise solution that avoids difficult value judgments on what kinds of influence are acceptable. However, I have only worked on this problem for a short time, so I still have a lot of probability mass on having missed an even more promising formulation. If there is such a formulation, my hunch is that it either imposes some kind of counterfactual information asymmetry at each time step or uses some equivalent of the Shapley value.

I'd like to thank TheMajor and Connor Flexman for their feedback.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 1:05 PM

So if I understand correctly, the problem with the naive proposal is something like this: We tell our AI to develop a cure for cancer while minimizing side effects. The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people. We can't just tell the AI to minimize side effects prior to task completion, because then it could set up a time bomb that goes off and generates lots of side effects after the task is complete.

Another way to put the problem: We'd like for the AI to be corrigible and also minimize side effects. Suppose the AI forecasts that its actions will motivate humans to take drastic action, with a large impact on the world, in order to interfere. A corrigible AI shouldn't work to stop this outcome. But a side effect-minimizing AI might decide to manipulate humans so they don't take drastic action. (This example seems a bit contrived because if corrigibility is working properly, you should be able to just use the off switch, and using the off switch doesn't seem all that high-impact?) Anyway, a possible way to address this issue would be to learn an impact measure that rates manipulating humans as a very high-impact action?

The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people.

Yes, if we told it to develop a cure, it might avoid letting us cure people to minimize impact (although I think there are even less benign failure modes that would be more likely to occur).

Regarding the second framing: perhaps a side effect minimizer using a naive counterfactual would do that, yes. The problem with viewing "manipulation" as high-impact is robustly defining manipulation. There's heavy value connotations with "free will" there.

The way I would put it is that the naive counterfactual plus whitelisting tries to stop other people from doing things that could lead to side effects, enforcing the impact measure on all actors. This is obviously terrible. Assuming agency allows for a solution* like the one I outline here.

Post-deadline rot13 edit:

Npghnyyl, Z jbhyqa'g or noyr gb qverpgyl perngr nal fhontragf Z - gur uhznaf jbhyq unir gb qb gung, qhr gb gur fgehpgher bs gur pbhagresnpghny.

This doesn't seem to change much; I'm still concerned about the feasibility of impact measures.

I like the proposed iterative formulation for the step-wise inaction counterfactual, though I would replace pi_Human with pi_Environment to account for environment processes that are not humans but can still "react" to the agent's actions. The step-wise counterfactual also improves over the naive inaction counterfactual by avoiding repeated penalties for the same action, which could help avoid offsetting behaviors for a penalty that includes reversible effects.

However, as you point out, not penalizing the agent for human reactions to its actions introduces a manipulation incentive for the agent to channel its effects through humans, which seems potentially very bad. The tradeoff you identified is quite interesting, though I'm not sure whether penalizing the agent for human reactions necessarily leads to an incentive to put humans in stasis, since that is also quite a large effect (such a penalty could instead incentivize the agent to avoid undue influence on humans, which seems good). I think there might be a different tradeoff (for a penalty that incorporates reversible effects): between avoiding offsetting behaviors (where the stepwise counterfactual likely succeeds and the naive inaction counterfactual can fail) and avoiding manipulation incentives (where the stepwise counterfactual fails and the naive inaction counterfactual succeeds). I wonder if some sort of combination of these two counterfactuals could get around the tradeoff.

I've thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent's current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don't happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).

For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the "vase near belt -> vase on belt" transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the "vase near belt -> vase on belt" transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.

So this issue is correct, as my post is written. I realized after the deadline that I hadn't spelled this out at all, and I didn't feel comfortable editing at that point; there's a little clarification in the post now.

For each time step , we're running both of those calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the simulation at that (simulated) time step and which manifest under the full plan. This means that if directly caused a side effect, it gets counted exactly once.

I agree that it's a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.

So I don't know how we could quantify "stopping humans from having effects" as an effect without a strong offsetting incentive.

Let's consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.

This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You're basically saying "optimize this utility the best you can without letting there be an actual impact". However, I actually hadn't thought of this formulation before, and it's plausible it's even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.

There's another problem, however: "people conclude that this AI design doesn't work and try another variant" is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about "what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI". The odds of that being the counterfactual are pretty slim. It's even possible that any agents/variants people would make in the counterfactual would have undefined behavior... Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation... where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.

The iterative formulation doesn't seem to have that failure mode.

Let's consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.

How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?

This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You're basically saying "optimize this utility the best you can without letting there be an actual impact". However, I actually hadn't thought of this formulation before, and it's plausible it's even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.

I expect that in complex tasks where we don't know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.

This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.

in what sense is it a form of whitelisting?

You're right, it isn't. I should have been more precise:

"Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn't use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected."

I expect that in complex tasks where we don't know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.

I think this failure mode on its own is relatively benign, given querying.

What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).

I suspect the/a ideal solution will have far fewer parameters (if any).