Open Problems in Negative Side Effect Minimization

Lukas Fluri

There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.

It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?

[-]Fabian Schimpf4yΩ020

Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.

[+][comment deleted]4y40

[-]Ben Smith3y10

One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.

This might be a way forward to deal with your "High-Impact Interference" problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI. When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.

[-]Fabian Schimpf3y20

Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can't do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.

[-]Ben Smith3y30

Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.

Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.

[-]Fabian Schimpf3y10

I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.

LESSWRONG
LW

LESSWRONG
LW

12

Open Problems in Negative Side Effect Minimization

12

Ω 2

12

Ω 2

Acknowledgments

TLDR;

Introduction

Background

Summary

Anticipated Questions

Goals of Side-Effect Minimization

Open Problems

Side-Effect Minimization Guarantees

Axioms

Conclusion

State-of-the-Art

The General Problem

Potential Future Work

Partial Observability and Chaotic Systems

Axioms

Conclusions

State-of-the-Art

Potential Future Work

High-Impact Interference

Axioms

Conclusion

State-of-the-Art

Potential Future Work

Appendix - Hypothesis: Future Tasks is Unsafe in Multi-Agent Scenario

Recap: How the Future Tasks Algorithm Works:

Main Issue

How This Might Backfire in our High-Impact Interference Scenario:

Axioms

Conclusion