0: Introduction

In a post a few months ago on pointing to environmental goals, Abram Demski reminded me of the appeal of defining good behavior by extensional examples. He uses the example of building bridges. The AI does a bunch of unsupervised learning to explore the simulation environment, so that when humans show it just a few labeled examples of good bridges, it will have pre-learned some high-level concepts that let it easily classify good bridges.

Unfortunately, this doesn't work, as Abram explains. But it seems like it should, in some sense - it seems like we have a basic approach that would work if only we understood some confusing details better. Maybe that's not so, but I think it's worth some effort.

One way of looking at this issue is that we're trying to understand concept learning - how the AI can emulate a human understanding of the world. Another way is as implementing an understanding of reference - giving the AI examples is an attempt to point at some "thing," and we want the AI to take this as a cue to find the concept being pointed at, not just look at the "finger" doing the pointing.

Over the last couple months I've been reading and thinking on and off about reference, and I've got about three posts worth of thoughts. This post will try to communicate what kind of value learning scheme I'm even talking about, point out some flaws, and provide a little background. The second post will start speculating about ways to get around some of these flaws and probably be the most applicable to AI, and the third post will be about humans and philosophy of reference.

1: Second Introduction

The goal, broadly, is to build an AI that satisfies human values. But no AI is going to know what human values are, or what it means to satisfy them, unless we can communicate those things, and it can learn them.

The impossible method to do this is to write down what it means to satisfy human values as a long list of program instructions. Most relevant to this post, it's impossible because nobody can write down human values by hand - we embody them, but we can't operationalize them any more than we can write down the frequency spectrum of the sounds we hear.

If we can't duplicate human values by hand, the only remaining option seems to be machine learning. The human has some complicated definition of "the right thing," and we just need to use [insert your favorite method] to teach this concept to the AI. The only trouble is that we're still a little bit fuzzy on how to define "human," "has," "definition," and "teach" in that sentence.

Still, value learning intuitively seems promising. It's like how, if you don't speak the same language as someone, you can still communicate by pointing. Given an AI with a comprehensive model of the world, it seems like we should be able to give it examples of human values being satisfied and say, somehow, "do that stuff."

To be more concrete, we might imagine a specific AI. Not something at the pinnacle of capability, just a toy model. This AI is made of three parts:

  • An unsupervised learning algorithm that learns a model of the world and rules for predicting the future state of the model.
  • A supervised algorithm that takes some labeled sensory examples of good behavior, plus the model of the world learned by the unsupervised algorithm, and tries to classify which sequences of states of the model are good.
  • To take actions, the AI just follows strategies that result in strongly classified-good states of its predictive model.

We're still many breakthroughs away from knowing how to build those parts, but if we assume they'll work, we can get a picture of an AI that has a complicated predictive model of the world, then tries to find the commonalities of the training examples and push the world in that direction. What could go wrong?

2: What could go wrong?

I have a big ol' soft spot for that AI design. But it will immediately, deeply fail. The thing it learns to classify is simply not going to be what we wanted it to learn. We're going to show it examples that, from our perspective, are an extensional definition of satisfying human values. But the concept we're trying to communicate is a very small target to hit, and there are many other hypotheses that match the data about as well.

Just as deep learning to recognize images will learn to recognize the texture of fur. or the shape of a dog's eye. but might not learn the silhouette of the entire dog, the classifier can do well on training examples without needing to learn all the features we associate with human value. And just as an image-recognizer will think that the grass in the background is an important part of being a dog, the classifier will learn things from examples that we think of as spurious.

The AI building its own world-model from unlabeled observations will help with these problems the same way that providing more data would, but it doesn't provide a principled solution. There will still be no exact analogue of the human concept we want to communicate, because of missing or spurious features. Or the AI might use a different level of abstraction than we expected - humans view the world through a particular way of chunking atoms into larger objects and a particular way of modeling other humans. Our examples might be more similar when considered in terms of features we didn't even think of.

Even worse, in some sense we are hoping that the AI isn't smart enough to learn the true explanation for the training examples, which is that humans picked them. We're trying to communicate goodness, not "the sort of thing humans select for the training set." To the extent that humans are not secure systems, there are adversarial examples that would get us to include them in the training set without being good. We might imagine "marketing examples" optimized for persuasiveness at the cost of goodness, or a series of flashing lights that would have caused you to hit the button to include it in the training set. This failure is the AI design being coded to look at the pointing finger, not the object pointed at.

All of these problems show up across many agent designs, implying that we are doing something wrong and don't know how to do it right. Here's the missing ability to do reference - to go from referring speech-acts to the thing being referred to. In order to figure out what humans mean, the AI should really reason about human intention and human categories (Dennett's intentional stance), and we have to understand the AI's reasoning well enough to connect it to the motivational system before turning the AI on.

3: Related ideas

The same lack of understanding that stands in our way to just telling an AI "Do what I mean!" also appears in miniature whenever we're trying to teach concepts to an AI. MIRI uses the example of a diamond-maximizing AI as something that seems simple but requires communicating a concept ("diamond") to the AI, particularly in a way that's robust to ontological shifts. Abram Demski uses the example of teaching an AI to build good bridges, something that's easy to approximate with current machine learning methods, but may fail badly if we hook that approximation up to a powerful agent. On the more applied end, a recent highlight is IBM training a recommendation system to learn guidelines from examples.

All those examples might be thought of as "stuff" - diamond, or good bridges, or age-appropriate movies. But we also want the AI to be able to learn about processes. This is related to Dylan Hadfield-Menell et al.'s work on cooperative inverse reinforcement learning (CIRL), which uses the example of motion on a grid (as is common for toy problems in reinforcement learning - see also Deepmind's AI safety gridworlds).

There are also broad concepts, like "love," which seem important to us but which don't seem to be stuff or processes per se. We might imagine cashing out such abstractions in terms of natural language processing and verbal reasoning, or as variables that help predict stuff and processes. These will come up later, because it does seem reasonable that "human flourishing" might be this sort of concept.

4: Philosophy! *shakes fist*

This reference issue is clearly within the field of philosophy. So it would be really wonderful if we could just go to the philosophy literature and find a recipe for how an AI needs to behave if it's to learn human referents from human references. Or at least it might have some important insights that would help with developing such a recipe. I thought it was worth a look.

Long story short, it wasn't. The philosophy literature on reference is largely focused on reference as a thing that inheres in sentences and other communications. Here's how silly it can get: it is considered a serious problem (by some) how, if there are multiple people named Vanya Ivanova, your spoken sentence about Vanya Ivanova can figure out which one it should be really referring to, so that it can have the right reference-essence.

Since computers can't perceive reference-essence, what I was looking for was some sort of functional account of how the listener interprets references. And there are certainly people who've been thinking more in this direction. Gricean implicature and so on. But even here, the people like Kent Bach, sharp people who seem to be going in the necessary direction, aren't producing work that looks to be of use to AI. The standards of the field just don't require you to be that precise or that functionalist.

5: What this sequence isn't

This post has been all about setting the stage and pointing out problems. We started with this dream of an AI design that learns to classify strategies as human-friendly based on a small number of examples of human-friendly actions or states, plus a powerful world-model. And then we immediately got into trouble.

My purpose is not to defend or fix this specific design-dream. It's to work on the deeper problem that lies behind many individual problems with this design. And by that I mean our ignorance and confusion about how an AI should implement the understanding of reference.

In fact our example AI probably isn't stably self-improving, or corrigible in Eliezer's sense of fully updated deference, or human-legible, or fail-safe-ish if we tell the AI the wrong thing. And that's fine, because that's not what the sequence is about. The question at hand is how to tell the AI anything at all, and have it understand what we meant, as we meant it.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 6:59 AM

So it would be really wonderful if we could just go to the philosophy literature and find a recipe for how an AI needs to behave if it’s to learn human referents from human references. Or at least it might have some important insights that would help with developing such a recipe.

I found Brian Cantwell Smith's On the Origin of Objects to be quite insightful on reference. While it isn't a formal recipe for giving AIs the ability to reference, I think its insights will be relevant to such a recipe if one is ever created.

(some examples of insights from the book: all reference is indexical in a way similar to how physical forces such as magnetism are indexical; apparently non-indexical references are formed from indexical reference through a stabilization process similar to image stabilization; abstractions are meaningful and effective through translations between them and direct high-fidelity engagement.)

Thanks for the recommendation, I'll check it out. From the library.

EDIT: Aw, it's checked out.

Even worse, in some sense we are hoping that the AI isn't smart enough to learn the true explanation for the training examples, which is that humans picked them.

Ah right, I don't often think about this but it's a good point and a likely source of Goodharting as systems become more capable.