Overcoming Clinginess in Impact Measures

[-]John_Maxwell7y40

So if I understand correctly, the problem with the naive proposal is something like this: We tell our AI to develop a cure for cancer while minimizing side effects. The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people. We can't just tell the AI to minimize side effects prior to task completion, because then it could set up a time bomb that goes off and generates lots of side effects after the task is complete.

Another way to put the problem: We'd like for the AI to be corrigible and also minimize side effects. Suppose the AI forecasts that its actions will motivate humans to take drastic action, with a large impact on the world, in order to interfere. A corrigible AI shouldn't work to stop this outcome. But a side effect-minimizing AI might decide to manipulate humans so they don't take drastic action. (This example seems a bit contrived because if corrigibility is working properly, you should be able to just use the off switch, and using the off switch doesn't seem all that high-impact?) Anyway, a possible way to address this issue would be to learn an impact measure that rates manipulating humans as a very high-impact action?

[-]TurnTrout7y10

The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people.

Yes, if we told it to develop a cure, it might avoid letting us cure people to minimize impact (although I think there are even less benign failure modes that would be more likely to occur).

Regarding the second framing: perhaps a side effect minimizer using a naive counterfactual would do that, yes. The problem with viewing "manipulation" as high-impact is robustly defining manipulation. There's heavy value connotations with "free will" there.

The way I would put it is that the naive counterfactual plus whitelisting tries to stop other people from doing things that could lead to side effects, enforcing the impact measure on all actors. This is obviously terrible. Assuming agency allows for a solution* like the one I outline here.

[-]TurnTrout7y30

Post-deadline rot13 edit:

Npghnyyl, Z jbhyqa'g or noyr gb qverpgyl perngr nal fhontragf Z - gur uhznaf jbhyq unir gb qb gung, qhr gb gur fgehpgher bs gur pbhagresnpghny.

This doesn't seem to change much; I'm still concerned about the feasibility of impact measures.

[-]Vika7yΩ220

I like the proposed iterative formulation for the step-wise inaction counterfactual, though I would replace pi_Human with pi_Environment to account for environment processes that are not humans but can still "react" to the agent's actions. The step-wise counterfactual also improves over the naive inaction counterfactual by avoiding repeated penalties for the same action, which could help avoid offsetting behaviors for a penalty that includes reversible effects.

However, as you point out, not penalizing the agent for human reactions to its actions introduces a manipulation incentive for the agent to channel its effects through humans, which seems potentially very bad. The tradeoff you identified is quite interesting, though I'm not sure whether penalizing the agent for human reactions necessarily leads to an incentive to put humans in stasis, since that is also quite a large effect (such a penalty could instead incentivize the agent to avoid undue influence on humans, which seems good). I think there might be a different tradeoff (for a penalty that incorporates reversible effects): between avoiding offsetting behaviors (where the stepwise counterfactual likely succeeds and the naive inaction counterfactual can fail) and avoiding manipulation incentives (where the stepwise counterfactual fails and the naive inaction counterfactual succeeds). I wonder if some sort of combination of these two counterfactuals could get around the tradeoff.

[-]Vika7yΩ230

I've thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent's current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don't happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).

For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the "vase near belt -> vase on belt" transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the "vase near belt -> vase on belt" transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.

[-]TurnTrout7yΩ220

So this issue is correct, as my post is written. I realized after the deadline that I hadn't spelled this out at all, and I didn't feel comfortable editing at that point; there's a little clarification in the post now.

For each time step $t = 1, \dots, T$ , we're running both of those $effects ()$ calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the $π_{M}^{: t}$ simulation at that (simulated) time step and which manifest under the full plan. This means that if $M$ directly caused a side effect, it gets counted exactly once.

I agree that it's a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.

[-]TurnTrout7yΩ220

So I don't know how we could quantify "stopping humans from having effects" as an effect without a strong offsetting incentive.

Let's consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.

This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You're basically saying "optimize this utility the best you can without letting there be an actual impact". However, I actually hadn't thought of this formulation before, and it's plausible it's even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.

There's another problem, however: "people conclude that this AI design doesn't work and try another variant" is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about "what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI". The odds of that being the counterfactual are pretty slim. It's even possible that any agents/variants people would make in the counterfactual would have undefined behavior... Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation... where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.

The iterative formulation doesn't seem to have that failure mode.

[-]Vika7yΩ110

Let's consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.

How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?

This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You're basically saying "optimize this utility the best you can without letting there be an actual impact". However, I actually hadn't thought of this formulation before, and it's plausible it's even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.

I expect that in complex tasks where we don't know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.

This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.

[-]TurnTrout7yΩ110

in what sense is it a form of whitelisting?

You're right, it isn't. I should have been more precise:

"Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn't use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected."

I expect that in complex tasks where we don't know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.

I think this failure mode on its own is relatively benign, given querying.

What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).

I suspect the/a ideal solution will have far fewer parameters (if any).

LESSWRONG
LW

LESSWRONG
LW

30

Overcoming Clinginess in Impact Measures

30

Ω 10

30

Ω 10

Clinginess

Thought Experiments

Assumptions

Formalization

One-Off

Example

Approach

Penalized Effects

Stationary Iterative

Example

Approach

Penalized Effects

Iterative

Example

Approach

Applications to Whitelisting

Class-ic

Latent Spaces

Fundamental Trade-Off

Forwards