I am extremely sympathetic to the program of AI safety by understanding value learning. Because of that sympathy, I have more thoughts than average prompted by Stuart Armstrong's post along those same lines.
Stuart's post mostly deals with "partial preferences," which are like simple statements of binary preference (A is better than B), but associated with a context - supposedly the "human's model" the human was using when they exhibited or stated that preference. Then the post says that you should sort these partial preferences according to meta-levels and aggregate them from the top down, updating your procedure after you finish each meta-level, eventually producing a utility function over world-histories.
Broadly, I'd say that my opinion is sort of like the bitter lesson. The bitter lesson in, say, image recognition, is that people wanted to do image recognition with a bunch of human-designed features and formal reasoning and human-understandable internal moving parts, and they tried that for a long time, and what worked was using way bigger models, way more computing power, much fewer human-understandable internal parts, and almost no human-designed features.
I like Stuart's outline more than most value learning proposals. But it still strikes me as primarily a list of human-designed features and human-understandable internal moving parts. We might be better off throwing away some of the details and abstracting in a way that allows for some of these problems to be solved by big models and computing power.
It's like the just-so story about ResNets, which is that they're a fix to humans thinking the insides of neural nets should look too much like human logic[^1]. I think speculating about the human-sized logical relationships between speculative parts inside the AI is easier but less useful than speculating about the algorithm that will connect your inputs to your outputs with a big model and lots of computing power, which may or may not have your logical steps as emergent features.
II: A long analogy about dams
If you want to design a dam, you don't draw the blueprint of the dam first and figure out what materials it should be made of later - first you learn a lot about hydrology and materials science so you know how steel and concrete and earth and water interact, then you draw the high-level design, then you fill in the details that weren't dictated either by physics or by your design goals. I'm claiming that we don't yet know much about the steel and water of value learning.
Here's a long digression as an example. Suppose you're trying to work out how to model human values the way humans do, even given lots of computing power and data. If you want to locate values within a model of humans, you can't just train the model for predictive power, because human values only appear in a narrow zone of abstraction, more abstract than biology and less abstract than population statistics, and an AI scored only on prediction will be pressured to go to a lower level of abstraction.
If you train an AI on a shared input of sensory data and a text channel from humans, will it learn a shared model of the world and the text channel that effectively solves the symbol grounding problem? Can you then activate desired concepts through the text channel, "cheating" a solution to lots of value learning problems?
No. Consider what happens in the limit of lots of resources, particularly if we are training this model for predictive power - it will be pressured towards a lower level of abstraction. Once it starts encoding the world differently than we do, it won't have the generalization properties we want - we'd be caught cheating, as it were. And if we could solve the training problem for verbal models, it seems like we could just solve the training problem to learn the concepts we want to learn. But maybe there's still some way to "cheat" in practice.
Another way to think of this problem is as making "artificial intentional stance." But we have to remember that the intentional stance is not just a single model (and definitely not the assumption that humans are like homo economicus.) It's a family of strategies used to learn about humans, model humans, and model interacting with humans. Stances aren't just an assumption about how to model one thing within a fixed model of the world, they're part of complete languages for talking about the world.
I want to know how to design an AI that not only develops approximate ways of understanding the world, but matches some of those ways of understanding to what it sees humans use. But even to do this, we don't really know how to talk in a principled way about what it is that it's supposed to be matching. So we've got to think about that.
This is an example of the sort of consideration that I think is urgent and interesting - and you can't always leave it as a detail to be filled in later, because depending on the base materials, the best design might be quite different.
III: Miscellaneous specific comments
Now some more specific comments about the proposal.
- How much of the hidden details are in eliciting partial preferences? I've sort of been implying that it's a lot. Does it require a general artificial intentional stance to extract not just binary preferences but also the model the human is using to express those preferences?
- How much of the hidden details are in doing meta-reasoning? If I don't trust an AI, more steps of meta-reasoning makes me trust it even less - humans often say things about meta-reasoning that would be disastrous if implemented. What kind of amazing faculties would be required for an AI to extract partial preferences about meta-reasoning that actually made things better rather than worse? If I was better at understanding what the details actually are, maybe I'd pick on meta-reasoning more.
I do agree that the meta-reasoning step is necessary for this scheme, but I think that's because this scheme doesn't involve the AI building an explicit model of humans to provide consistency - it's repeatedly outsourcing the modeling job to amnesiac single-shot modules. If humans were reliable sources about meta-reasoning principles for combining binary preferences, this would work great, but since they aren't it won't - a low-level practical concern dictating higher-level design.
- The "symbol grounding module"'s job seems to be to take the partial preferences inside the partial preferences' contextual models and translate them into full preferences in the AI's native ontology. This seems like it requires the AI to have a really trustworthy grasp on the intentional stance and its variations - maybe I should imagine this as coming from the same process that originates those contextual models for partial preferences in the first place. This is a bit different than the symbol grounding I normally think about (grounding of internal symbols by their causal relationship to reality), but I agree it's an important part of the artificial intentional stance.
[^1]: The story goes something like this: When people first thought of neural networks, they thought of as if each neuron was a logical node making a human-sized step in reasoning. And so they optimized the initialization of weights and the nonlinearity for each individual neuron functioning like a discriminator. But after many years of work, people realized that the "neurons are doing human-sized logical steps" model wasn't the best, and a better picture is that the neural network is massaging the input manifold around in a higher-dimensional space until eventually the input space gets transformed into something that's easy to classify. And so the people developed ResNets that were specialized for this gradual massaging of the input into the output, and they worked great.