Communicating concepts in value learning

Manfred

Epistemic status: Trying to air out some thoughts for feedback, we'll see how successfully. May require some machine learning to make sense, and may require my level of ignorance to seem interesting.

Many current proposals for value learning are garden-variety regression (or its close cousin, classification). The agent doing the learning starts out with some model for what human values look like (a utility function over states of the world, or a reward function in a Markov decision process, or an expected utility function over possible actions), and receives training data that tells it the right thing to do in a lot of different situations. And so the agent finds the parameters of the model that minimize some loss function with the data, and Learns Human Values.

All these models of "the right thing to do" I mentioned are called parametric models, because they have some finite template that they update based on the data. Non-parametric models, on the other hand, have to keep a record of the data they've seen - prediction with a non-parametric model often looks like taking some weighted average of nearby known examples (though not always), while a parametric model would (often) fit some curve to the data and predict using that. But we'll get back to this later.

An obvious problem with current proposals is that it's very resource-intensive to communicate a category or concept to the agent. An AI might be able to automatically learn a lot about the world, but if we want to define its preferences, we have to somehow pick out the concept of "good stuff" within the representation of the world learned by the AI. Current proposals for this look like supervised learning, where huge amounts of labeled data are needed to specify "good stuff," and for many proposals I'm concerned that we'll actually end up specifying "stuff that humans can be convinced is good," which is not at all the same. Humans are much better learners than these supervised learning systems - they learn from fewer examples, and have a better grasp of the meaning and structure behind examples. This hints that there are some big improvements to be made in value learning.

This comparison to humans also leads to my vaguer concerns. It seems like the labeled examples are too crucial, and the unlabeled data not crucial enough. We want a value learner to understand concepts based on just a few examples so long as it has unlabeled data to fill in the gaps, and be able to learn more about morality from observation as a core competency, not as a pale shadow of its learning from labeled data. It seems like fine-tuning the model for the labeled data with stochastic gradient descent is missing something important.

To digress slightly, there are additional problems (e.g. corrigibility) once you build an agent that has an output channel instead of merely sponging up information, and these problems are harder if we want value learning from observation. If we want a value learning agent that could learn a simplified version of human morality, and then use that to learn the full version, we might need something like the Bayesian guarantee of Dewey 2011, or a functional analogue thereof.

One inspiration for alternative learning schemes might be clustering. As a toy example, imagine finding literal clusters in thing-space by k-means clustering. If you want to specify a cluster, you can do something like pick a small sample of examples and force them to be in the same cluster, and allow the number of clusters you try to find in the data to vary so that the statistics of the mandatory cluster are not very different from any other's. The huge problem here is that the idea of "thing-space" elides the difficulty of learning a representation of the world (or equivalently, elides how really, really complicated the cluster boundaries are in terms of observations).

Because learning how to understand the world already requires you to be really good at learning things, it's not obvious to me what identifying and using clusters in the data will entail. One might imagine that if we modeled the world using a big pile of autoencoders, this pile would already contain predictors for many concepts we might want to specify, but that if we use examples to try and communicate a concept that was not already learned, the pile might not even contain the features that make our concept easy to specify. Further speculation in this vein is fun, but is likely pointless at my current level of understanding. So even though learning well from unlabeled data is an important desideratum, I'm including this digression on clustering because I think it's interesting, not because I've shed much light.

Okay, returning to the parametric/non-parametric thing. The problem of being bad at learning from unlabeled data shows up in diverse proposals like inverse reinforcement learning and Hibbard 2012's two-part example. And in these cases it's not due to the learning algorithm per se, but for the simple reason that at some point the representation of the world is treated as fixed - the value learner is assumed to understand the world, and then proceeds to learn or be told human values in terms of that understanding. If you can no longer update your understanding of the world, naturally this causes problems with learning from observation.

We should instead design agents that are able to keep learning about the world. And this brings us back to the idea of communicating concepts via examples. The most reasonable way to update learned concepts in light of new information seems to be to just store the examples and re-apply them to the new understanding. This would be a non-parametric model of learned concepts.

What concepts to learn and how to use them to make decisions is not at all known to me, but as a placeholder we might consider the task of learning to identify "good actions," given proposed actions and some input about the world (similar to the "Learning from examples" section of Christiano's Approval Directed Agents).

Humans are much better learners than these supervised learning systems - they learn from fewer examples, and have a better grasp of the meaning and structure behind examples. This hints that there are some big improvements to be made in value learning.

Josh Tenenbaum's work is relevant in figuring out how to achieve this. E.g. this.

I already skimmed this paper after looking through your review article looking for interesting papers, but it was worth a re-read, thanks. I'll follow up some references later. I like how it just completely unironically brings up the Chinese restaurant process and the Indian buffet process. I think the examples are fairly easy, in the sense of low numbers of features and quite simple desired models, and I'd be interested to know what the limitations are that lead to this fact.

I fear this misses an important reason why new work is needed on concept learning for superintelligent agents: straightforward clustering is not necessarily a good tool for concept learning when the space of possible actions is very large, and the examples and counterexamples cannot cover most of it.

To take a toy example from this post, imagine that we have built an AI with superhuman engineering ability, and we would like to set it the task of making us a burrito. We first present the AI with millions of acceptable burritos, along with millions of unacceptable burritos and objects that are not burritos at all. We then ask it to build us things that are more like the positive examples than like the negative examples.

I claim that this is likely to fail disastrously if it evaluates likeness by straightforward clustering in the space of observables it can scan about the examples. All our examples and counterexamples lie on the submanifold of "things we (and previous natural processes) are able to build", which has high codimension in the manifold of "things the AI is able to build".

A burrito with a tiny self-replicator nanobot inside, for instance, would cluster closer to all of the positive examples than to all of the negative examples, since there are no tiny self-replicating nanobots in any of the examples or counterexamples, and in all other respects it matches the examples better. (Or a toxic molecule that has never before occurred in nature or been built by humans, etc.)

The sense in which those would be poor attempts to learn the concept are simply not captured by straightforward clustering, and it's not enough to say that we should try non-parametric models, we would need to think about how a non-parametric model might do this well. (Here's an example of a parametric learner which tries to confront this problem.)

A key part of the idea (which, again, I think has some fatal flaws) was that concepts are clusters within some representation of the world, which is learned unsupervised, and is in some sense good at predicting the world. One way to think of this representation is as a set of features whose activity levels parsimoniously describe the data about each example. This requires that a disproportionate fraction of the space of feature activations maps close to the manifold that the examples lie on in the space of raw data.

Of course, you have to choose which features to cluster over, which requires some Bayesian tradeoff between getting a tight fit to the examples (high likelihood) and simplicity of the features (high prior) (clearly I just finished Kaj's linked paper). But overall I think that unsupervised feature learning is tackling almost exactly the problem you pointed out.

In practice, there might be some problems. A potent toxin or a self-replicating nanobot are bad because they cause harm to whatever eats it, but would even a superintelligence learn a feature to detect safety to humans if all it saw of the universe was one million high-resolution scans of burritos? Well, maybe. But I'd trust it more if it also got to observe the context and consequences of burrito-consumption.

Anyhow, I agree with you that "be non-parametric!" is not necessarily helpful advice for producing safe burritos. The claim I put forward in the last paragraphs is that if you represent the agent's goals non-parametrically in terms of examples, in the most obvious way, we seem to avoid some problems with improving the agent's ontology.

Sounds all reasonable but I'm not entirely clear what you are driving at.

I'd like to pick a tangent:

One might imagine that if we modeled the world using a big pile of autoencoders, this pile would already contain predictors for many concepts we might want to specify, but that if we use examples to try and communicate a concept that was not already learned, the pile might not even contain the features that make our concept easy to specify.

This reminds me of a recent discussion around whether future AIs might be able to better communicate than humans because they could be able to exchange the meaning of intermediate layers in their deep learning architectures whereas we can communicate the terminal symbols only. This circled in my mind when I saw a children's picture book (one where simple clear pictures allow parents to name objects) and I thought: We can not only name terminal symbols in our 'deep learning architecture' we can name a lot of intermediate 'facets' of 'objects'. I don't mean sub-objects like 'leg' or 'surface'. Call them properties like 'yellow', 'round', 'smooth' or even more vague features like 'beautiful'. I think that we are basically able to name all those intermediate features that can be communicated at all. Sometimes there is no word or there is no need for a word usually or the few cases of experiences that are seldom and hard to share like e.g. certain trance states. But even in these cases we could imagine that it is possible in principle to communicate the aspects/facets of our perception.