I promised in a previous post that I would give a post-mortem for a scheme for learning the intentional stance from natural language. This is that post. But first, I should explain why such an idea might seem good in the first place.
Some people think of AI as a genie. The goal of AI research, in this picture, is to "tell the AI what to do," sometimes explicitly in natural language. And then since the AI is smart, it will understand what we mean and do that, because to do something else would be stupid.
This is, in a sense, very naive. Making an AI that does what we want is not at all like instructing a human - see the relevant Eliezer post - the methods, dangers, and goals are all different. But... if the AI understood what we meant, maybe we could just tell it what to do.
Of course, "understood what we meant" captures more or less the whole problem, because meaning isn't like the charge of the electron, it's nowhere in the words themselves. When you understand moral language, you're implicitly using your morals. But what if we trained an AI so that it functionally understood moral language - would that be implicitly using your morals too, and isn't that exactly what we want?
1 - The Artificial Intentional Stance
I like to think of myself as having preferences, but at the same time I am made of atoms, and my preferences are not-like-the-charge-of-the-electron, they're nowhere in the atoms. Instead, my preferences are an abstraction that I (and others) use when thinking about me.
So part of the this artificial intentional stance stuff can be summed up as: get the AI to think about humans like humans think about humans. (Another part is that abstractions are contagious. If I want to go to the gym, to handle this correctly this you need abstractions not just for me but also for the gym.)
We often put too much magic into the word "understand." If the AI can hold a good conversation and extract real-world information from human speech, it's reasonable to say it understands what we're saying. And then once it understands us, you might think "communicating our goals to it is a lot like communicating with a human."
But it's easy to hold a decent conversation without taking the intentional stance towards humans, and easier still to extract real-world information from human speech without the intentional stance. This leads to problems that become clear if you try to take an AI that does a good job at modeling language, and follow step by step how to get it to choose actions that are good for humans.
The dream is to learn the intentional stance by using the information implicit in our use of language. The intentional stance requires picking out good levels of abstraction to model humans on, and using language implies that the good levels of abstraction are the ones humans implicitly use in language. Is this what we want? I don't know, it might be?
It certainly isn't the only option - we might imagine other schemes involving trying to amplify emulations of humans, semi-supervised learning from examples of good and bad behavior, or multi-stage chains of making increasingly trustworthy AIs. But the question is whether it's an option.
3 - Process
It's not hard to hook up a videocamera to an image captioner to a deep reinforcement learner and say you can input goals with natural language because when you set the goal to "cat," your camera will look for cats. It's a lot harder to get that camera to look for what's best in life.
This is the bind I got myself into, writing this post. Value learning schemes that are simple are wrong (issues with the cat camera above are left as an exercise to the reader), and value learning schemes that seem promising have been selected for incomprehensibility and poor epistemic luck. So I tried to split the difference, if favoring interestingness over simplicity.
Here's some of the rules of thumb I used when thinking of ways to apply natural language processing to value learning:
First, I wanted to avoid the scheme having glaringly unspecified parts. It's very easy to be lazy and not specify something enough that it actually chooses actions, or feel like I've made progress but not be able to apply it. Usually either of these meant I was sweeping problems under the rug - the right level of specificity involves sweeping out some of those cobwebs.
Second, I needed to encourage myself to be specific about the intended purpose of natural language processing in each particular scheme. Yes, the dream is that it "includes common sense" or something like that, but that's not specific enough mental technology to tell whether you're solving the intended problem without unnecessary side effects, or explain why different methods get different results.
It was profitable to think of natural language processing as being targeted at the problem of alien concepts: when the AI can match your training examples but still fail to generalize how you want because it's representing your examples in an alien way. For example, an image classifier might learn to distinguish dogs by the texture of their fur, but we're not going to be happy with how that generalizes to fur-less dogs or dog-less fur. Now replace "dogs" with "human values" and "fur" with "superficial features that work well on the training set."
An even more specific purpose of natural language would be "greedy reification" - actively trying to form concepts that correspond to linguistic tokens. So if we have a word "dog," we want to incentivize the AI to form a concept that picks out dogs in the world-model, and then the hope is that this also works on "human values."
4 - Mort
So here's a value learning scheme: try to squish the world and natural language into the same latent space, just with different input/output functions.
Training this simultaneous model might just be separately trying to do encoding-decoding or prediction tasks with sensory data and text, but more plausibly it should involve translation tasks where we can associate words with sensory environments. The model required is somewhat subtle, because we don't want words associated with raw sense data, we want words associated with the state of the AI's model of the world. This mandates that to the world-model, this latent space should look like the persistent state associated with sequence prediction or encoding-decoding of sequences of sense data, with transition dynamics partially included in the shared information. This means the language model should also look like sequence prediction or encoding with some local state consisting of the high-level features.
If I haven't said anything impossible so far, we could use sufficiently advanced technology to train this simultaneous model so that it's good at understanding both the world and language, and competent at turning one into the other when it comes to simple training examples. Can you now solve value learning by giving it a bunch of English descriptions what we want ("human values satisfied," "do the right thing," "a fulfilling and cosmopolitan future for the galaxy," et c.), and coding it to choose actions that make the state of the world like that?
Looking on the bright side first, what advantages might this have?
- It leverages the structure of language in a way that might be what we want, by training its shared high-level features to be suitable for language modeling as well as world-modeling. This is that greedy reification property - the label "dog" gets a representation that captures the functional properties of the "dog" label in human speech and description of the world, which is at least a higher order of problem than the image classifier that identifies "dog" with the texture of dog fur.
- The knowledge of the world also impacts the verbal reasoning. If we ask it to do some difficult verbal task involving rotating imagined objects, the dream is that it could take advantage of some shared representation that would code for rotating objects on the world-model side, and so has the right dynamics.
- Scaling seems to work in our favor. To the extent that this model follows our naive notion of "understanding," the better it understands what we mean the better job it will do.
- With small modifications, it could mimic CIRL's state of uncertainty about human values. It would need to keep track of (and properly update) uncertainty in its model and propagate that forward to uncertainty in its reward, which it can then maximize by picking strategies that account for value of information.
What are some big issues with this? Take a second if you like.
No, really, I'd be interested in what people come up with on their own. I don't understand this family of schemes as well as I'd like.
- 1) The communication between the two sides could be incomplete. In fact, we would expect there to be parts of the shared latent space that one side ignores but get treated as useful information by the other, if they get separate training. This is worrisome if we want to rely on translating a difficult verbal concept into an opinion on states of the world - what if key information is encoded in the features that the other encoder has learned to ignore? To the extent that the phrase "human values" does help us predict the world, it seems like this should be one of the problems that decreases with increasing competence, but I'm not totally sure.
- 2) This simple scheme has zero extra protection from unintended maxima. Goodhart's law rears its ugly head - if our target is a particular state of the model of the world, then maybe there are nearby world-states that score well on the objective function but are "unnatural" in some way. This is a general problem that appears any time you have a system trained on prediction but asked to do intervention. How does this square with using natural language to learn non-alien concepts? It seems plausible that natural language might help concepts generalize to new contexts that are well-suited to language, but still leave some unnatural contexts lurking around misclassified.
- 3) At first blush, this doesn't care about metaethics. It doesn't take into account what we'd like to like, or how we think about values, in order to change its own function. It's not a process that learns about humans and then outputs a decision procedure, it's a fixed decision procedure that extensively uses learning about humans. It's possible that this could actually be patched by putting our meta-ethical desires into the verbal "wish" - or even our desire for it to learn a model of meta-ethics. Each such kludge seems to increase the difficulty of the task, but I'm not sure if anything actually breaks (or if anything actually gets solved).
- 4) (3) wouldn't be a problem if this were meta-ethically great, but it's probably not. The most glaring example is that it aims for a fixed goal state at an approximately fixed level of abstraction, which may not be what we'd want on reflection. If we think of the goal state as living in some "slice" along the translation of features from language to microphysical states of the world, it's conceivable that we choose (or set some mysterious hand-tuned parameter) to put this slice at the wrong place (or maybe there is no right place). Using planning towards a goal also means that it's happy to violate deontological rules, and take you apart if it can sue your atoms for something better.
5 - Wrap-up
Going back to the artificial intentional stance and the problem of alien concepts, it seems like this helps in some ways but not in others.
It seems to help with the intentional stance at the object level - the everyday work of translating "Charlie wants to go to the gym" into reasonable actions - but not at the meta-level. It's doubtful that it's modeling humans how they want to be modeled. Maybe this indicates that it would be profitable to break down this concept further. It also might spark your imagination about how to take the same information about humans and end up with something that models humans in a variable way.
A different thing is going on in the department of alien concepts, where we've run into the stress that Goodhart's law places on concepts. Instead of thinking about human-modeling, this makes me want to focus on the decision procedure and training. Can we find a decision procedure that leverages prediction and translation tasks in a way that puts less stress on the concepts? Can we use a training procedure that reduces the context shift when trying to use the model to choose actions?
Overall I think this avenue is pretty interesting to think about. Maybe this also serves as a concrete example of what I mean by trying to create the artificial intentional stance, which can be generalized from language to other options for learning about humans.