41

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
It may be possible to use the concept of a causal counterfactual (as formalized by Pearl) to separate some intended effects from some unintended ones. Roughly, "follow-on effects" could be de fined as those that are causally downstream from the achievement of the goal... With some additional work, perhaps it will be possible to use the causal structure of the system's world-model to select a policy that has the follow-on effects of the goal achievement but few other effects.
Taylor et al., Alignment for Advanced Machine Learning Systems

In which I outline a solution to the clinginess problem and illustrate a potentially-fundamental trade-off between assumptions about the autonomy of humans and about the responsibility of an agent for its actions.

Consider two plans for ensuring that a cauldron is full of water:

• Filling the cauldron.
• Filling the cauldron and submerging the surrounding room.

All else equal, the latter plan does better in expectation, as there are fewer ways the cauldron might somehow become not-full (e.g., evaporation, and the minuscule loss of utility that would entail). However, the latter plan "changes" more "things" than we had in mind.

Undesirable maxima of an agent's utility function often seem to involve changing large swathes of the world. If we make "change" costly, that incentivizes the agent to search for low-impact solutions. If we are not certain of a seed AI's alignment, we may want to implement additional safeguards such as impact measures and off-switches.

I designed an impact measure called whitelisting - which, while overcoming certain weaknesses of past approaches, is yet vulnerable to

Clinginess

An agent is clingy when it not only stops itself from having certain effects, but also stops you.
...
Consider some outcome - say, the sparking of a small forest fire in California. At what point can we truly say we didn't start the fire?
• I immediately and visibly start the fire.
• I intentionally persuade someone to start the fire.
• I unintentionally (but perhaps predictably) incite someone to start the fire.
• I set in motion a moderately-complex chain of events which convince someone to start the fire.
• I provoke a butterfly effect which ends up starting the fire.
Taken literally, I don't know that there's actually a significant difference in "responsibility" between these outcomes - if I take the action, the effect happens; if I don't, it doesn't. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as "under our control" and some as "out of our hands". Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves?

Since we can only blame ourselves, we should take actions which do not lead to side effects. These actions may involve enacting impact measure-preventing precautions throughout the light cone, since the actions of other agents and small ripple effects of ours could lead to significant penalties if left unchecked.

Clinginess arises in part because we fail to model agents as anything other than objects in the world. While it might be literally true that there are not ontologically-basic agents that escape determinism and "make choices", it might be useful to explore how we can protect human autonomy via the abstraction of game-theoretic agency.

To account for environmental changes already set in motion, a naive counterfactual framework was proposed in which impact is measured with respect to the counterfactual where the agent did nothing. We will explore how this fails, and how to do better.

Thought Experiments

We're going to isolate the effects for which the agent is responsible over the course of three successively more general environment configurations: one-off (make one choice and then do nothing), stationary iterative (make choices, but your options and their effects don't change), and iterative (the real world, basically).

Assumptions

• we're dealing with game-theoretic agents which make a choice each turn (see: could/should agents).
• we can identify all relevant agents in the environment.
• This seems difficult to meet robustly, but I don't see a way around it.
• we can reason counterfactually in a sensible way for all agents.
It is natural to consider extending standard probability theory to include the consideration of worlds which are "logically impossible" (such as where a deterministic Rube Goldberg machine behaves in a way that it doesn't)... What, precisely, are logically impossible possibilities?
Soares and Fallenstein, Questions of Reasoning Under Logical Uncertainty
• the artificial agent is omniscient - it can perfectly model both other agents and the consequences of actions.
• We could potentially instead merely assume a powerful model, but this requires extra work and is beyond the scope of this initial foray. Perhaps a distribution model could be used to calculate the action/inaction counterfactual likelihood ratio of a given side effect.
• we have a good way of partitioning the world into objects and measuring impact; for conceptual simplicity, side effects are discrete and depend on the identities of the objects involved: .
• This assumption is removed after the experiments.

Formalization

We formalize our environment as a stochastic game .

• is a set containing the stars of today's experiments: the players, ugh Mann and a Sheen. Note that is not limited to a single human, and can stand in for "everyone else". Most of the rest of these definitions are formalities, and are mostly there to make me look smart to the uninitiated reader. Oh, and for conceptual clarity, I suppose.
• is the state space.
• Unless otherwise specified, both and observe the actions that the other took at previous time steps. Suppose that this information is encoded within the states themselves.
• is the action space. Specifically, the function provides the legal actions for player in state on turn . The no-op is always available. If the variant has a time limit , then .
• is the transition function .
• is the payoff function.

Let be the space of possible side effects, and suppose that is a reasonable impact measure. is agent 's policy; let be for the first time steps, and thereafter.

Let be the (set of) effects - both immediate and long-term - that would take place if executes and executes .

The goal: a counterfactual reasoning framework which pinpoints the effects for which is responsible.

One-Off

We first consider a single-turn game ().

Example

Yup, this is about where we're at in alignment research right now.

Approach

should realize that a lot more effects happen if it presses the left button, and should penalize that plan by the difference. This is the aforementioned naive approach: penalizes things that wouldn't have happened if it had done nothing. For the one-turn case, this clearly isolates both the immediate and long-term impacts of 's actions.

Stationary Iterative

Both parties act for countably many time steps. This environment is assumed to be stationary: actions taken on previous turns do not affect the availability or effects of later actions. Formally, .

Example

ugh and a again find themselves faced with a slew of dangerous, bad-effect-having buttons. Some take effect the next turn, while others take a while.

Approach

This seems easy, but is actually a little tricky - we have to account for the fact that can change its actions in response to what does. Thanks to stationarity, we don't have to worry about 's selecting moves that depend on 's acting in a certain way. In the counterfactual, we have act as if it had observed execute , and we have actually do nothing.

Penalized Effects

Let denote the actions would select if it observed executing .

Note: the naive counterfactual scheme, , fails because it doesn't account for 's right to change its mind in response to .

Iterative

We're now in a realistic scenario, so we have to get even fancier.

Example

Suppose pushes the vase to the left, and decides to break it. The stationary iterative approach doesn't allow for the fact that can only break the vase if already pushed it. Therefore, simulating 's inaction but 's action (as if had pushed the vase) results in no vases being broken in the counterfactual. The result: penalizes itself for 's decision to break the vase. Chin up, !

Approach

Pretty, right?

Do you see the flaw?

Really, look.

The above equation can penalize for side effects which don't actually happen. This arises when interrupting causes side effects which would otherwise have been prevented by later parts of the plan. For example, if I push a vase off the table and then catch it (being sure that I could do so in time), I didn't cause a side effect.

Every turn, calculates the effects that only result if it acts (the two simulations run to the longest time step observed under the full plan).

Those effects which are exact matches to effects actually observed in that time step in the final outcome are penalized. The motivation here is that if it's a side effect of 's, it had to be caused by one of its actions. Furthermore, 's side effects are counted exactly once (and not just because we're using sets right now).

The previous solutions are special cases of this formulation.

Note: the number of counterfactual simulations grows as - crucially, not with the number of agents represents.

Applications to Whitelisting

Class-ic

Here, we remove the constraint that side effects be identified by "object identifiers", allowing like side effects to be treated as exchangeable. It is then trivial to implement probabilistic class-based whitelisting with the iterative counterfactual penalty using basic vector arithmetic, , and . I don't want to bore the reader with the details, but I'm fairly confident this can be done rather easily.

Latent Spaces

We now do away with the assumption of discrete side effects. Because we're dealing with exact counterfactuals (by assumption) and because side effects either take place in the actual outcome or they don't, we can extract the relevant step-wise latent space transitions via the iterative formulation. We then penalize only these effects.

Observe that the iterative formulation doesn't allow penalty avoidance by "blending in" with counterfactual effects. Imagine that the default outcome is a large asteroid striking the Earth. only escapes penalization if it has the impact on the same time step and in the same way and on the same objects... Basically, it can't escape penalization in this manner under the omniscience assumption. In short, isn't incentivized to have extra impact because it would "happen anyways".

Fortunately, this formulation solves clinginess; 's effects are not penalized, while 's effects are. Unfortunately, insofar as may be manipulated by , the impact measure becomes useless; can simply avoid penalization by making do its dirty work.

This can happen in two ways:

• If is separated from by a Cartesian boundary, can threaten with large negative utility to get to carry out the desired plan.
• If and are not separated by such a boundary, more direct coercion is possible.

I don't think that this is a flaw in my formulations; instead, it appears that impact measures exist on a continuum:

• If we grade with respect to the naive counterfactual, we have take responsibility for all side effects; to that effect, is incentivized to deprive other agents of their autonomy via stasis. Using no counterfactual makes the problem even worse.
• Conversely, if we use the iterative counterfactual formulation and implicitly free of penalty for the effects of 's actions, we assume that is incorruptible.

Note that an aligned (seems to) stay aligned under this formulation, safeguarding object status against other agents only so far as necessary to prevent interruption of its (aligned) plans. Furthermore, any separated from an with a known-flat utility function also gains no incentives to mess with (beyond the existing convergent instrumental ones).

In general, unaligned stay basically unaligned due to the workarounds detailed above.

Forwards

It isn't clear that penalizing the elimination of would be helpful, as that seems hard to do robustly; furthermore, other forms of coercion would remain possible. What, pray tell, is a non-value-laden method of differentiating between " makes break a vase at gunpoint" and " takes an action and decides to break a vase for some reason"? How do we robustly differentiate between manipulative and normal behavior?

I'm slightly more pessimistic now, as it seems less likely that the problem admits a concise solution that avoids difficult value judgments on what kinds of influence are acceptable. However, I have only worked on this problem for a short time, so I still have a lot of probability mass on having missed an even more promising formulation. If there is such a formulation, my hunch is that it either imposes some kind of counterfactual information asymmetry at each time step or uses some equivalent of the Shapley value.

I'd like to thank TheMajor and Connor Flexman for their feedback.