Learning with catastrophes

[-]Rohin Shah7yΩ120

I'm not sure how necessary it is to explicitly aim to avoid catastrophic behavior -- it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice. Of course, it would be better to have stronger guarantees against catastrophic behavior, so I certainly support research on learning from catastrophes -- but if it turns out to be too hard, or impose too much overhead, it could still be fine to aim for corrigibility alone.

I do want to make a perhaps obvious note: the assumption that "there are some policies such that no matter what nature does, the resulting transcript is never catastrophic" is somewhat strong. In particular, it precludes the following scenario: the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations). In this case, for any observation that the oracle would label as catastrophic, there is an environment that regardless of the agent's action outputs that observation. So for this problem to be solvable, we need to either have a limit on what the environment "could do", or an oracle that judges "catastrophe" based on the agent's action in addition to outcomes (which I suspect will cache out to "are the actions in this transcript knowably going to cause something bad to happen"). In the latter case, it sounds like we are trying to train "robust corrigibility" as opposed to "never letting a catastrophe happen". Do you have a sense for which of these two assumptions you would want to make?

[-]Wei Dai7y20

I’m not sure how necessary it is to explicitly aim to avoid catastrophic behavior—it seems that even a low capability corrigible agent would still know enough to avoid catastrophic behavior in practice.

Paul gave a bit more motivation here: (It's a bit confusing that these two posts are reposted here out of order. ETA on 1/28/19: Strange, the date on that repost just changed to today's date. Yesterday it was dated November 2018.)

If powerful ML systems fail catastrophically, they may be able to quickly cause irreversible damage. To be safe, it’s not enough to have an average-case performance guarantee on the training distribution — we need to ensure that even if our systems fail on new distributions or with small probability, they will never fail too badly.

My interpretation of this is that learning with catastrophes / optimizing worst-case performance (I believe these are referring to the same thing, which is also confusing) is needed to train an agent that can be called corrigible in the first place. Without it, we could end up with an agent that looks corrigible on the training distribution, but would do something malign ("applies its intelligence in the service of an unintended goal") after deployment.

[-]Rohin Shah7y20

Yeah, that makes sense, also the distinction between benign and malign failures in that post seems right. It makes much more sense that learning with catastrophes is necessary for corrigibility.

[-]rmoehn6y10

In particular, it precludes the following scenario: the environment can do anything computable, and the oracle evaluates behavior only based on outcomes (observations).

Paul explicitly writes that the oracle sees both observations and actions: ‘This oracle can be applied to arbitrary sequences of observations and actions […].’

or an oracle that judges "catastrophe" based on the agent's action in addition to outcomes (which I suspect will cache out to "are the actions in this transcript knowably going to cause something bad to happen")

This is also covered:

Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions:

The agent made a catastrophically bad decision.

The agent’s observations are plausible: we have a right to expect the agent to be able to handle those observations.

[-]Rohin Shah6y40

Paul explicitly writes that the oracle sees both observations and actions: ‘This oracle can be applied to arbitrary sequences of observations and actions […].’

I know; I'm asking how the oracle would have to work in practice. Presumably at some point we will want to actually run the "learning with catastrophes algorithm", and it will need an oracle, and I'd like to know what needs to be true of the oracle.

This is also covered

Indeed, my point with that sentence was that it sounds like we are only trying to avoid catastrophes that could have been foreseen, as opposed to literally all catastrophes as the post suggests, which is why the next sentence is:

In the latter case, it sounds like we are trying to train "robust corrigibility" as opposed to "never letting a catastrophe happen".

[-]rmoehn6y10

"never letting a catastrophe happen" would incentivize the agent to spend a lot of resources on foreseeing catastrophes and building capacity to ward them off. This would distract from the agent's main task. So we have to give the agent some slack. Is this what you're getting at? The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn't say anything how it would do this?

[-]Rohin Shah6y30

The oracle needs to decide whether or not the agent can be held accountable for a catastrophe, but the article doesn't say anything how it would do this?

Yes, basically. I'm not saying the article should specify how the oracle should do this, I'm saying that it should flag this as a necessary property of the oracle (or argue why it is not a necessary property).

[-]rmoehn6y10

I agree.

[-]avturchin7y10

May be we could also add an onboard predictor of the Catastrophic Oracle? It works as following:

1) In a game environment, agent A suggests many transcripts which are evaluated by the Catastrophe Oracle (CO).

2) Based on this data, the agent A learns to predict the output of the Catastrophic Oracle; in other words, there is a separate neural net or other best available machine learning system which learns to predict the Oracle's output. This is a CO-predictor.

3) When acting in real world in a new situation and with a new transcript, but when there is no time to actually query the real Catastrophic Oracle, the agent predicts the possible output of the Oracle based on the function learned on Step 2 and chose the strategies which are less likely to trigger the Catastrophic Oracle.

4) Between all strategies which will not trigger the CO predictor, the agent chose those which small variations will also will not trigger CO predictor. This is needed to prevent complex crazy plans to escape CO (E.g. if a self-driving car tries to pass between two pedestrians via drifting, it is a fragile strategy).

Example: a child learns that her mother is angry when she broke a cup and some other things. Now when she takes a fragile thing, she could imagine that she could broke it and that her mother will be unhappy; in other words, she internalised parent control via her ability to predict anger of her mother.

LESSWRONG
LW

LESSWRONG
LW

27

Learning with catastrophes

27

Ω 11

27

Ω 11

Modeling catastrophes

Batch learning

The oracle

Approach

Adversarial training

Conclusion