Epistemic status: Trying to air out some thoughts for feedback, we'll see how successfully. May require some machine learning to make sense, and may require my level of ignorance to seem interesting.
Many current proposals for value learning are garden-variety regression (or its close cousin, classification). The agent doing the learning starts out with some model for what human values look like (a utility function over states of the world, or a reward function in a Markov decision process, or an expected utility function over possible actions), and receives training data that tells it the right thing to do in a lot of different situations. And so the agent finds the parameters of the model that minimize some loss function with the data, and Learns Human Values.
All these models of "the right thing to do" I mentioned are called parametric models, because they have some finite template that they update based on the data. Non-parametric models, on the other hand, have to keep a record of the data they've seen - prediction with a non-parametric model often looks like taking some weighted average of nearby known examples (though not always), while a parametric model would (often) fit some curve to the data and predict using that. But we'll get back to this later.
An obvious problem with current proposals is that it's very resource-intensive to communicate a category or concept to the agent. An AI might be able to automatically learn a lot about the world, but if we want to define its preferences, we have to somehow pick out the concept of "good stuff" within the representation of the world learned by the AI. Current proposals for this look like supervised learning, where huge amounts of labeled data are needed to specify "good stuff," and for many proposals I'm concerned that we'll actually end up specifying "stuff that humans can be convinced is good," which is not at all the same. Humans are much better learners than these supervised learning systems - they learn from fewer examples, and have a better grasp of the meaning and structure behind examples. This hints that there are some big improvements to be made in value learning.
This comparison to humans also leads to my vaguer concerns. It seems like the labeled examples are too crucial, and the unlabeled data not crucial enough. We want a value learner to understand concepts based on just a few examples so long as it has unlabeled data to fill in the gaps, and be able to learn more about morality from observation as a core competency, not as a pale shadow of its learning from labeled data. It seems like fine-tuning the model for the labeled data with stochastic gradient descent is missing something important.
To digress slightly, there are additional problems (e.g. corrigibility) once you build an agent that has an output channel instead of merely sponging up information, and these problems are harder if we want value learning from observation. If we want a value learning agent that could learn a simplified version of human morality, and then use that to learn the full version, we might need something like the Bayesian guarantee of Dewey 2011, or a functional analogue thereof.
One inspiration for alternative learning schemes might be clustering. As a toy example, imagine finding literal clusters in thing-space by k-means clustering. If you want to specify a cluster, you can do something like pick a small sample of examples and force them to be in the same cluster, and allow the number of clusters you try to find in the data to vary so that the statistics of the mandatory cluster are not very different from any other's. The huge problem here is that the idea of "thing-space" elides the difficulty of learning a representation of the world (or equivalently, elides how really, really complicated the cluster boundaries are in terms of observations).
Because learning how to understand the world already requires you to be really good at learning things, it's not obvious to me what identifying and using clusters in the data will entail. One might imagine that if we modeled the world using a big pile of autoencoders, this pile would already contain predictors for many concepts we might want to specify, but that if we use examples to try and communicate a concept that was not already learned, the pile might not even contain the features that make our concept easy to specify. Further speculation in this vein is fun, but is likely pointless at my current level of understanding. So even though learning well from unlabeled data is an important desideratum, I'm including this digression on clustering because I think it's interesting, not because I've shed much light.
Okay, returning to the parametric/non-parametric thing. The problem of being bad at learning from unlabeled data shows up in diverse proposals like inverse reinforcement learning and Hibbard 2012's two-part example. And in these cases it's not due to the learning algorithm per se, but for the simple reason that at some point the representation of the world is treated as fixed - the value learner is assumed to understand the world, and then proceeds to learn or be told human values in terms of that understanding. If you can no longer update your understanding of the world, naturally this causes problems with learning from observation.
We should instead design agents that are able to keep learning about the world. And this brings us back to the idea of communicating concepts via examples. The most reasonable way to update learned concepts in light of new information seems to be to just store the examples and re-apply them to the new understanding. This would be a non-parametric model of learned concepts.
What concepts to learn and how to use them to make decisions is not at all known to me, but as a placeholder we might consider the task of learning to identify "good actions," given proposed actions and some input about the world (similar to the "Learning from examples" section of Christiano's Approval Directed Agents).