Epistemic status: One part quotes (informative, accurate), one part speculation (not so accurate).
One avenue towards AI safety is the construction of "moral AI" that is good at solving the problem of human preferences and values. Five FLI grants have recently been funded that pursue different lines of research on this problem.
The projects, in alphabetical order:
Most contemporary AI systems base their decisions solely on consequences, whereas humans also consider other morally relevant factors, including rights (such as privacy), roles (such as in families), past actions (such as promises), motives and intentions, and so on. Our goal is to build these additional morally relevant features into an AI system. We will identify morally relevant features by reviewing theories in moral philosophy, conducting surveys in moral psychology, and using machine learning to locate factors that affect human moral judgments. We will use and extend game theory and social choice theory to determine how to make these features more precise, how to weigh conflicting features against each other, and how to build these features into an AI system. We hope that eventually this work will lead to highly advanced AI systems that are capable of making moral judgments and acting on them.
Techniques: Top-down design, game theory, moral philosophy
Previous work in economics and AI has developed mathematical models of preferences, along with algorithms for inferring preferences from observed actions. [Citation of inverse reinforcement learning] We would like to use such algorithms to enable AI systems to learn human preferences from observed actions. However, these algorithms typically assume that agents take actions that maximize expected utility given their preferences. This assumption of optimality is false for humans in real-world domains. Optimal sequential planning is intractable in complex environments and humans perform very rough approximations. Humans often don't know the causal structure of their environment (in contrast to MDP models). Humans are also subject to dynamic inconsistencies, as observed in procrastination, addiction and in impulsive behavior. Our project seeks to develop algorithms that learn human preferences from data despite the suboptimality of humans and the behavioral biases that influence human choice. We will test our algorithms on real-world data and compare their inferences to people's own judgments about their preferences. We will also investigate the theoretical question of whether this approach could enable an AI to learn the entirety of human values.
Techniques: Trying to find something better than inverse reinforcement learning, supervised learning from preference judgments
The future will see autonomous agents acting in the same environment as humans, in areas as diverse as driving, assistive technology, and health care. In this scenario, collective decision making will be the norm. We will study the embedding of safety constraints, moral values, and ethical principles in agents, within the context of hybrid human/agents collective decision making. We will do that by adapting current logic-based modelling and reasoning frameworks, such as soft constraints, CP-nets, and constraint-based scheduling under uncertainty. For ethical principles, we will use constraints specifying the basic ethical ``laws'', plus sophisticated prioritised and possibly context-dependent constraints over possible actions, equipped with a conflict resolution engine. To avoid reckless behavior in the face of uncertainty, we will bound the risk of violating these ethical laws. We will also replace preference aggregation with an appropriately developed constraint/value/ethics/preference fusion, an operation designed to ensure that agents' preferences are consistent with the system's safety constraints, the agents' moral values, and the ethical principles of both individual agents and the collective decision making system. We will also develop approaches to learn ethical principles for artificial intelligent agents, as well as predict possible ethical violations.
Techniques: Top-down design, obeying ethical principles/laws, learning ethical principles
The objectives of the proposed research are (1) to create a mathematical framework in which fundamental questions of value alignment can be investigated; (2) to develop and experiment with methods for aligning the values of a machine (whether explicitly or implicitly represented) with those of humans; (3) to understand the relationships among the degree of value alignment, the decision-making capability of the machine, and the potential loss to the human; and (4) to understand in particular the implications of the computational limitations of humans and machines for value alignment. The core of our technical approach will be a cooperative, game-theoretic extension of inverse reinforcement learning, allowing for the different action spaces of humans and machines and the varying motivations of humans; the concepts of rational metareasoning and bounded optimality will inform our investigation of the effects of computational limitations.
Techniques: Trying to find something better than inverse reinforcement learning (differently this time), creating a mathematical framework, whatever rational metareasoning is
Autonomous AI systems will need to understand human values in order to respect them. This requires having similar concepts as humans do. We will research whether AI systems can be made to learn their concepts in the same way as humans learn theirs. Both human concepts and the representations of deep learning models seem to involve a hierarchical structure, among other similarities. For this reason, we will attempt to apply existing deep learning methodologies for learning what we call moral concepts, concepts through which moral values are defined. In addition, we will investigate the extent to which reinforcement learning affects the development of our concepts and values.
Techniques: Trying to identify learned moral concepts, unsupervised learning
The elephant in the room is that making judgments that always respect human preferences is nearly FAI-complete. Application of human ethics is dependent on human preferences in general, which are dependent on a model of the world and how actions impact it. Calling an action ethical also can also depend on the space of possible actions, requiring a good judgment-maker to be capable of search for good actions. Any "moral AI" we build with our current understanding is going to have to be limited and/or unsatisfactory.
Limitations might be things like judging which of two actions is "more correct" rather than finding correct actions, only taking input in terms of one paragraph-worth of words, or only producing good outputs for situations similar to some combination of trained situations.
Two of the proposals are centered on top-down construction of a system for making ethical judgments. Designing a system by hand, it's nigh-impossible to capture the subtleties of human values. Relatedly, it seems weak at generalization to novel situations, unless the specific sort of generalization has been forseen and covered. The good points of a top down approach are that it can capture things that are important, but are only a small part of the description, or are not easily identified by statistical properties. A top-down model of ethics might be used as a fail-safe, sometimes noticing when something undesirable is happening, or as a starting point for a richer learned model of human preferences.
Other proposals are inspired by inverse reinforcement learning. Inverse reinforcement learning seems like the sort of thing we want - it observes actions and infers preferences - but it's very limited. The problem of having to know a very good model of the world in order to be good at human preferences rears its head here. There are also likely unforseen technical problems in ensuring that the thing it learns is actually human preferences (rather than human foibles, or irrelevant patterns) - though this is, in part, why this research should be carried out now.
Some proposals want to take advantage of learning using neural networks, trained on peoples' actions or judgments. This sort of approach is very good at discovering patterns, but not so good at treating patterns as a consequence of underlying structure. Such a learner might be useful as a heuristic, or as a way to fill in a more complicated, specialized architecture. For this approach like the others, it seems important to make the most progress toward learning human values in a way that doesn't require a very good model of the world.