Thanks, this is helpful to me.

I’ve also had discussions at CHAI about whether we should expect humans to have adversarial examples (usually vision, not values).

I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?

I can’t speak to all of the cases, but in the example you point out, you’re asking about a paper whose main author is not on this forum (or at least, I’ve never seen a contribution from him, though he could have a username I don’t know). People are busy, it’s hard to read everything, let alone respond to everything.

Oh, I didn't realize that, and thought the paper was more of a team effort. However as far as I can tell there hasn't been a lot of discussion about the paper online, and the comments I wrote under the AF post might be the only substantial public online comments on the paper, so "let alone respond to everything" doesn't seem to make much sense here.

I certainly expect people’s quick, intuitive value judgments to be horribly wrong anywhere outside of current environments; it’s not obvious to me that’s true for probabilities given by explicit reasoning.

It seems plausible that given enough time and opportunities to discuss with other friendly humans, explicit reasoning can eventually converge upon correct judgments, but explicit reasoning can certainly be wrong very often in the short or even medium run, and even eventual convergence might happen only for a small fraction of all humans who are especially good at explicit reasoning. I think there are also likely analogies to adversarial examples for explicit human reasoning, in the form of arguments that are extremely persuasive but wrong.

But I guess humans do have a safety mechanism in that system 1 and system 2 can cross-check each other and make us feel confused when they don't agree. But that doesn't always work since these systems can be wrong in the same direction (motivated cognition is a thing that happens pretty often) or one system can be so confident that it overrides the other. (Also it's not clear what the safe thing to do is (in terms of decision making) when we are confused about our values.)

This safety mechanism may work well enough to prevent exploitation of adversarial examples by other humans a lot of the time, but seems unlikely to hold up under heavier optimization power. (You could perhaps consider things like Nazism, conspiracy theories, and cults to be examples of successful exploitation by other humans.)

(I wonder if people have tried to apply the idea of heterogeneous systems cross-checking each other to adversarial examples in ML. Have you seen any literature on this?)

I think that to the extent you are hoping to scale up safety in parallel with capabilities, which feels a lot more tractable than solving the full problem in one go, this is not a problem you have to deal with yet, and you can outsource it to the future.

I guess that makes sense, but even then there should at least be an acknowledgement that the problem exists and needs to be solved in the future?

I noticed that in your recent FLI interview, you discussed applying the idea of distributional shifts to human values. Did you write about that or talk to anyone about it before I made my posts?

I had not written about it, but I had talked about it before your posts. If I remember correctly, I started finding the concept of distributional shifts very useful and applying it to everything around May of this year. Of course, I had been thinking about it recently because of your posts so I was more primed to bring it up during the podcast.

so "let alone res
... (read more)

Two Neglected Problems in Human-AI Safety

by Wei_Dai 1 min read16th Dec 201823 comments


Ω 24

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In this post I describe a couple of human-AI safety problems in more detail. These helped motivate my proposed hybrid approach, and I think need to be addressed by other AI safety approaches that currently do not take them into account.

1. How to prevent "aligned" AIs from unintentionally corrupting human values?

We know that ML systems tend to have problems with adversarial examples and distributional shifts in general. There seems to be no reason not to expect that human value functions have similar problems, which even "aligned" AIs could trigger unless they are somehow designed not to. For example, such AIs could give humans so much power so quickly or put them in such novel situations that their moral development can't keep up, and their value systems no longer apply or give essentially random answers. AIs could give us new options that are irresistible to some parts of our motivational systems, like more powerful versions of video game and social media addiction. In the course of trying to figure out what we most want or like, they could in effect be searching for adversarial examples on our value functions. At our own request or in a sincere attempt to help us, they could generate philosophical or moral arguments that are wrong but extremely persuasive.

(Some of these issues, like the invention of new addictions and new technologies in general, would happen even without AI, but I think AIs would likely, by default, strongly exacerbate the problem by differentially accelerating such technologies faster than progress in understanding how to safely handle them.)

2. How to defend against intentional attempts by AIs to corrupt human values?

It looks like we may be headed towards a world of multiple AIs, some of which are either unaligned, or aligned to other owners or users. In such a world there's a strong incentive to use one's own AIs to manipulate other people's values in a direction that benefits oneself (even if the resulting loss to others are greater than gains to oneself).

There is an apparent asymmetry between attack and defense in this arena, because manipulating a human is a straightforward optimization problem with an objective that is easy to test/measure (just check if the target has accepted the values you're trying to instill, or has started doing things that are more beneficial to you), and hence relatively easy for AIs to learn how to do, but teaching or programming an AI to help defend against such manipulation seems much harder, because it's unclear how to distinguish between manipulation and useful information or discussion. (One way to defend against such manipulation would be to cut off all outside contact, including from other humans because we don't know whether they are just being used as other AIs' mouthpieces, but that would be highly detrimental to one's own moral development.)

There's also an asymmetry between AIs with simple utility functions (either unaligned or aligned to users who think they have simple values) and AIs aligned to users who have high value complexity and moral uncertainty. The former seem to be at a substantial advantage in a contest to manipulate others' values and protect one's own.


Ω 24