We believe the AI safety community needs to invest research effort in the human side of AI alignment. Many of the uncertainties involved are empirical, and can only be answered by experiment. They relate to the psychology of human rationality, emotion, and biases. Critically, we believe investigations into how people interact with AI alignment algorithms should not be held back by the limitations of existing machine learning. Current AI safety research is often limited to simple tasks in video games, robotics, or gridworlds, but problems on the human side may only appear in more realistic scenarios such as natural language discussion of value-laden questions. This is particularly important since many aspects of AI alignment change as ML systems increase in capability.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 5:19 AM

While I entirely agree with the title of this essay, as a social scientist, I am not sure that this is really a social science agenda (yet).

The biggest issue is the treatment of human values as a "species-level" set of decision algorithms: this is orthogonal to the projects of anthropology and sociology (which broadly reject the idea of an essential core of human ethics). Trivially, the research here looks like it would produce a potentially misleading average result across cultural value systems, which might well not usefully represent either any individual system or produce a universally-acceptable "compromise" result in those cases which would be processed differently in different cultural frames.

Second, the use of the "debate" approach looks like a Schelling game mechanism - essentially the same adjudication "game" used in numerous cryptoeconomics schemes, notably Kleros and Aragon. Following that line of logic, what this proposal is really doing is training an AI to win Schelling bets: the AI is betting, like jurors in the Kleros project, that their "vote" will align with those of the human supervisors. This is a narrow field of value judgement, anthropologically speaking. Moreover, it is worth noting that there has been substantive progress in the use of ML systems to predict the outcome of litigation; however, I do not think that anyone working in that field would classify their work as progress towards friendly AI, and rightly so. Thus the question remains of whether an AI which was capable of producing consistently Schelling point-aligned judgements was actually running an "ethical decision algorithm" properly so called, or just a well-trained "case outcome predictor."

We can see how this mode of reasoning might apply, say, in the case of autonomous road vehicles, where the algorithm (doubtless to the delight of auto insurers) was trained to produce the outcomes, in the case of a crash, which would generate the lowest tort liability (civil claim) for the owner/insurer. This might well align with local cultural as well as legal norms (assuming the applicable tort law system corresponded reasonably well to the populace's culturally defined ethical systems), but such a correspondence would be incidental to the operation of the decision algorithm and would certainly not reflect any "universal" system of human values.

Finally, I would question whether this agenda (valuable though it is) is really social science. It looks a lot more like cognitive science. This is not a merely semantic distinction. The "judicial model" of AI ethics which is proposed here is, I think, so circumscribed as to produce results of only limited applicability (as in the example above). Specifically, the origin of ethical-decisional cognition is always already "social" in a sense which this process does not really capture: not only are ethical judgements made in their particular historical-geographical-cultural context, they also arise from the interaction between agent and world in a way which the highly restricted domain proposed here effectively (and deliberately?) excludes.

Thus while the methodology proposed here would generate somewhat useful results (we might, for instance, design a RoboJuror to earn tokens in a Kleros-like distributed dispute resolution system, based on the Schelling bet principle), it is questionable whether it advances the goal of friendly AI in itself. AI Safety does need social scientists - but they need to study more radically social processes, in more embedded ways, than is proposed here.

I'm glad to see this, although I think it still doesn't go far enough. I expect that social science has much to say about minds in general even if it currently focuses mainly on humans or non-human Earth animals in that I believe it is full of insights about how complex "thinking" systems work that, if we asked it to say something about minds or agents in general by weakening certain assumptions, we'd find many interesting ideas to guide safety work. True, some of this has already happened and my impression is that so far people aren't much impressed by what it's produced, but I think that more reflects the infancy of AI currently working on things that are not very mind-like yet rather than a general lack of applicability (although this depends greatly on what you think AI systems in the future will look like).

FWIW I also find something slightly patronizing about the general tone of this piece, as if it's not entirely to say "hey, here's how you can help" but rather "hey, here's how you're allowed to help", but maybe that's just me.

The authors propose to try to predict how well AI alignment schemes will work in the future by doing experiments today with humans playing the part of AIs, in large part to obtain potential negative results that would tell us which alignment approaches to avoid. For example if DEBATE works poorly with humans playing the AI roles, then we might predict that it will also work poorly with actual AIs. This reminds me of Eliezer's AI boxing experiments with Eliezer playing the role of an AI trying to talk a gatekeeper into letting it out of its box. It seems they missed an opportunity cite that as an early example of the type of experiments they're proposing.

In contrast to Eliezer however, the authors here are hoping to obtain more fine-grained predictions and positive results. In other words, Eliezer was trying to show that an entire class of safety techniques (i.e., boxing) is insufficiently safe, not to select the most promising boxing techniques out of many candidates, whereas the authors are trying to do the equivalent of that here:

If social science research narrows the design space of human-friendly AI alignment algorithms but does not produce a single best scheme, we can test the smaller design space once the machines are ready.

I wish the authors talked more about why testing with actual AIs is safe to do and safe to rely on. If such testing is not very safe then we probably need to narrow down the design space a lot more than can be done through the kind of experimentation proposed in the paper. Strong theoretical guidance would be an example of that. Of course in that case this kind of experimentation (I wish there was a pithy name for it) would still be useful as an additional check on the theory, but there wouldn't be a need to do so much of it.

However, even if we can’t achieve such absolute results, we can still hope for relative results of the form “debate structure A is reliably better than debate structure B″. Such a result may be more likely to generalize into the future, and assuming it does we will know to use structure A rather than B.

I'm skeptical of this, because advanced AIs are likely to have a very different profile of capabilities from humans, which may cause the best debate structure for humans to be different from the best debate structure for AIs. (But if there is a lot of resources available to throw at the alignment problem, like the paper suggests, and assuming that the remaining uncertainty can be safely settled through testing with actual AIs, having such results is still better than not having them.)

Inaction or indecision may not be optimal, but it is hopefully safe, and matches the default scenario of not having any powerful AI system.

I want to note my disagreement here but it's not a central point of the paper so I should probably write a separate post about this.