Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).

One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.

The projects, in alphabetical order:

Most contemporary AI systems base their decisions solely on consequences, whereas humans also consider other morally relevant factors, including rights (such as privacy), roles (such as in families), past actions (such as promises), motives and intentions, and so on. Our goal is to build these additional morally relevant features into an AI system. We will identify morally relevant features by reviewing theories in moral philosophy, conducting surveys in moral psychology, and using machine learning to locate factors that affect human moral judgments. We will use and extend game theory and social choice theory to determine how to make these features more precise, how to weigh conflicting features against each other, and how to build these features into an AI system. We hope that eventually this work will lead to highly advanced AI systems that are capable of making moral judgments and acting on them.

Techniques: Top-down design, game theory, moral philosophy

Previous work in economics and AI has developed mathematical models of preferences, along with algorithms for inferring preferences from observed actions. [Citation of inverse reinforcement learning] We would like to use such algorithms to enable AI systems to learn human preferences from observed actions. However, these algorithms typically assume that agents take actions that maximize expected utility given their preferences. This assumption of optimality is false for humans in real-world domains. Optimal sequential planning is intractable in complex environments and humans perform very rough approximations. Humans often don't know the causal structure of their environment (in contrast to MDP models). Humans are also subject to dynamic inconsistencies, as observed in procrastination, addiction and in impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite the suboptimality of humans and the behavioral biases that influence human choice. We will test our algorithms on real-world data and compare their inferences to people's own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.

Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments

The future will see autonomous agents acting in the same environment as humans, in areas as diverse as driving, assistive technology, and health care. In this scenario, collective decision making will be the norm. We will study the embedding of safety constraints, moral values, and ethical principles in agents, within the context of hybrid human/agents collective decision making. We will do that by adapting current logic-based modelling and reasoning frameworks, such as soft constraints, CP-nets, and constraint-based scheduling under uncertainty. For ethical principles, we will use constraints specifying the basic ethical ``laws'', plus sophisticated prioritised and possibly context-dependent constraints over possible actions, equipped with a conflict resolution engine. To avoid reckless behavior in the face of uncertainty, we will bound the risk of violating these ethical laws. We will also replace preference aggregation with an appropriately developed constraint/value/ethics/preference fusion, an operation designed to ensure that agents' preferences are consistent with the system's safety constraints, the agents' moral values, and the ethical principles of both individual agents and the collective decision making system. We will also develop approaches to learn ethical principles for artificial intelligent agents, as well as predict possible ethical violations.

Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles

The objectives of the proposed research are (1) to create a mathematical framework in which fundamental questions of value alignment can be investigated; (2) to develop and experiment with methods for aligning the values of a machine (whether explicitly or implicitly represented) with those of humans; (3) to understand the relationships among the degree of value alignment, the decision-making capability of the machine, and the potential loss to the human; and (4) to understand in particular the implications of the computational limitations of humans and machines for value alignment. The core of our technical approach will be a cooperative, game-theoretic extension of inverse reinforcement learning, allowing for the different action spaces of humans and machines and the varying motivations of humans; the concepts of rational metareasoning and bounded optimality will inform our investigation of the effects of computational limitations.

Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is

Autonomous AI systems will need to understand human values in order to respect them. This requires having similar concepts as humans do. We will research whether AI systems can be made to learn their concepts in the same way as humans learn theirs. Both human concepts and the representations of deep learning models seem to involve a hierarchical structure, among other similarities. For this reason, we will attempt to apply existing deep learning methodologies for learning what we call moral concepts, concepts through which moral values are defined. In addition, we will investigate the extent to which reinforcement learning affects the development of our concepts and values.

Techniques: Trying to identify learned moral concepts, unsupervised learning 


The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.

Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.

Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.

Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.

Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 10:13 AM

Good summary. But concerning your final point :

For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.

I suspect this is impossible in principle, because human values are dependent on our models of the world.

The key is to develop methods that scale; where values become aligned as the world model approaches human level of capability.

But then there is a scope, apparently unexplored so far, for finding morally relevant subsets of value. You don't have to see everything's though the lens of utilitarianism.

I'm an advocate of this approach in general for a number of reasons, and it's typically how I explain the idea of FAI to people without seeming like a prophet of the end times. Most of the reasons I like value-learning focus on what happens before a super-intelligence or what happens if a super-intelligence never comes into being.

I am strongly of the opinion that real world testing and real world application of theoretical results often exposes totally unanticipated flaws, and it seems like for the value-learning problem that partial/incomplete solutions are still tremendously useful. This means that progress on the value-learning problem is likely to attract lots of attention and resources and that consequently proposed solutions will be more thoroughly tested in the real world.

Some of the potential advantages:

  1. Resources: It seems like there's a strong market incentive for understanding human preferences in the form of various recommendation engines. The ability to discern human values, even partially, translates well into any number of potentially useful applications. Symptoms of success in this type of research will almost certainly attract the investment of substantial additional resources to the problem, which is less obviously true for some of the other research directions.

  2. Raising the sanity waterline: Machines aren't seen as competitors for social status and it's typically easier to stomach correction from a machine than from another person. The ability to share preferences with a machine and get feedback on the values those preferences relate to would potentially be an invaluable tool for introspection. It's possible that this could result in people being more rational or even more moral.

  3. Translation: Humans have never really tried to translate human values into a form that would be comprehensible to a non-human before. Value learning is a way to give humans practice discovering/explaining their values in precise ways. This, to my mind, is preferable to the alternative approach of relying on a non-human actor to successfully guess human morality. One of my human values is for humans to have a role in shaping the future, and I'd feel much more comfortable if we got to contribute in a meaningful way to the estimate of human values held by any future super-intelligence.

  4. Relative Difficulty: The human values problem is hard, but discovering human values from data is probably much harder than just learning/representing human values. Learning quantum mechanics is hard, but the discovery of the laws of quantum mechanics was much much more difficult. If we can get human values problem small enough to make it into a seed AI, the chances of AI friendliness increase dramatically.

I haven't taken the time here to consider in detail how the approaches outlined in your post interact with some of these advantages, but I may try and revisit them when I have the opportunity.

I feel like a mixed approach is the most desirable. There is a risk that if the AI is allowed to simply learn from humans, we might get a greedy AI that maximizes its Facebook experience while the rest of the World keeps dying of starvation and wars. Also, our values probably evolve with time (slavery, death penalty, freedom of speech...) so we might as well try and teach the AI what our values should be rather than what they are right now. Maybe then it's the case of developing a top-down, high level ethical system and use it to seed a neural network that then picks up patterns in more detailed scenarios?

It feels to me like you are straying off the technical issues by looking at a huge picture.

In this case, a picture so huge it's unsolvable. So here's an assertion which might be interesting: Its better to focus on clusters of small, manageable machine-ethics problems and gradually build up to a Grand Scheme, or more likely in my guess, a Grand Messy But Workable System, rather than teasing-out a Bible of global ethical abstraction. There's no working consensus on ethical rules anyway, outside the Three Laws.

An example, maybe already solved: autonomous cars are coming quite soon, much sooner than most of us thought. Several people have wondered about the machine ethics of a car in a crash situation, assuming you accept Google's position that humans will never react fast enough to resume control. Various trolley problem-like scenarios of minimizing irrevocable hurt to humans have been kicked around. But I think I already read a solution to the decision problem in the discussion-

a) Ethical-decisions-during-crash is going to be a very rare occurrence. b) The over-all reduction in accidents is much more significant than a small subset of accidents theoretically made worse by the robot cars. c) Humans can't agree on complex algorithms for the hypothetical proposed scenarios anyway. d) Machines always need a default mode when the planned-for reactions conflict.

So if you accept a-d above then you'll probably agree that simply having the car slow to stop and pull over to the side as best it can, is the default which will produce the least damage. This is the same routine to follow if the car comes upon debris in the road, a wreck, confusing safety beacons, some catastrophe with the road itself and so forth. It's pretty much what you'd tell your teenager to do.

But I think there are lessons to draw from the robot cars: 1) The robot, though fully autonomous in every-day situations, will encounter in an accident, an ever-narrowing range of options in its decision-tree so that it will end up with the default option only. In contrast, a human will panic and take action which often adds options to an already-over-loaded decision tree, options which can't be evaluated in real-time and whose outcomes are probably worse than just stopping as fast as possible anyway. 2) Robots don't have to be perfect, they just have to be better than humans in the aggregate, and, see #1, default to avoiding action when disaster strikes. 3) Once you get to #2, then you are already better than humans and therefore saving lives and property. At this point the engineers can further tune the robot to improve gradually.

So what about the paper-clip-monster, the AGI that wants to run the world and most important, writes its own code? I agree it could be done in theory, just as we'll surely have computers running artificial evolution scenarios with DNA, and data-mining/surveillance on a scale so huge it makes the Stazi look like kindergarten. But as everyone has noted, writing your own code is utterly uncharted territory. A lot of LW commentators treat the prospect with myth: they propose an AGI that is better described as an alien overlord than a machine. Myth may be the only way humans can wrap their brains around an idea so big. Engineers won't even try. They'll break the problem up into bits, do a lot of error-checking at a level of action they do understand, and run it in the lab to see what happens. For instance if there is still a layered approach to software, the OS might have the safety mechanisms built in, and maybe won't be self-upgradable, while the self-written code will run in apps that rely on the OS, then after a hundred similar steps of divide-and-conquer the system will be useful and controllable. But truly, I too am just hand-waving in a vacuum. Please continue...

I think the huge picture is pretty important to look at. If we know the goal is far away, then we know that current projects are not going to get their usefulness from solving the whole problem. But that's fine, there are plenty of other uses for projects. Among others:

  • Early attempts can serve as landmarks for following ones, to help understand the problem.
  • Projects might work on implementing pieces that seem potentially useful given the big picture, like scaling with the skill of an externally-given world-model (and then this list can be applied to those sub-problems).
  • Blue-sky research on things that seem to have potential, without a mind to immediate application. Learning that the goal is far away means we want more blue-sky research.
  • Projects might be somewhere in-between, trying to integrate a novel development into current well-performing systems.