We can’t write down our precise values any more than we can write down the algorithm we use for judging whether an image contains a cat. If we want an AI to abide by human values, it's going to have to acquire them without us writing them down. We usually think of this in terms of a process of value learning.
The easiest kind of value learning involves starting with some pre-written model of humans - this part for the beliefs, and this part for the values, and so on - and tuning its internal parameters until it does a good job on a corpus of training data. The problem is that we want this human model to have lots of nice properties, each of which makes it harder to find a model that will satisfy us.
There's a tension between models that have clear values, and models that are psychologically realistic. The ideal value-haver is homo economicus, the sterile decision-theoretic agent. Consider what such a model must make of a training dataset that includes humans buying lottery tickets, and not wearing seatbelts, and being sold products by modern ad campaigns. The model has no leeway. It must assume that humans are behaving optimally, and therefore that there is some intrinsic value in lottery-tickets and seatbelt-free driving that should be preserved into the far future. As for the survival of humanity as a whole - well, if humans aren't taking the optimal action to ensure it, it must not matter all that much.
The homo economicus model is too psychologically unrealistic to learn what we mean by human values. But if you allow that humans might be lazy, or biased, or incapable of getting a handle on the consequences of their actions, then you're adding more and more degrees of freedom to your model. The more you allow for human action to not reflect their modeled values, the more underdetermined the modeled values are.
One of the various guarantees that people try to extract from value learning schemes is that if humans really did work according to your model, your value learning scheme would eventually make your model of the human converge to the human. With even fairly tame models of human bias, you quickly lose this sort of guarantee as the model becomes rich enough to learn unintended answers.
Let's change gears and talk about neural networks. It's not too big of a topic switch, though, because neural networks are a family of models that are often given way more free parameters than are necessary to solve the problem. This shows up as the problem of overfitting - if you do a better and better job of making the model correct on the training set, it actually does a worse job of generalizing, like a student who copies someone else's answers rather than learning the material.
The interesting part is not so much that overfitting exists, it's that there's anything other than overfitting. As neural networks get trained, their ability to generalize becomes very good (as you might notice if you've been paying attention to their results over the last decade) before it turns around and gets worse due to overfitting. With proper training procedures you can stop training while the model is at its peak of generalization, at the low cost of setting aside part of your training data. Again, this is all despite solving an underdetermined problem.
There are also modifications to the training procedure, broadly called regularization, which trade away pure pursuit of correctness on the training data to try to nudge the model towards better generalization properties. Regularization often works by imposing a cost function that reduces the effective dimensionality of the model, which makes sense from an underdetermination = overfitting perspective, but it's not just analogous to decreasing the number of nodes in a neural net; a regularized large network can do better after training than any non-regularized smaller network.
If you're keeping track of the analogy to value learning at home, these ideas are like learning human values by starting with a big, complicated model and then training in a way that stops before you overfit, or uses some kind of cost function to push the model into the part of the solution space you want.
Sometimes you don't have to directly optimize for the information you want. This is like the easy value learning scheme from part 1, where you optimize a human model but only care about the part labeled "values." It's also like word2vec, where the AI learns to predict a word from its neighbors, but you only care about the vector-space representation of words it developed along the way.
But rather than word2vec, a more interesting (not to mention topical) analogy might be to GPT-2. GPT-2 can answer homework questions. Even though it's only been trained to predict the next word, if you prompt it with "Q: What year was the Magna Carta signed? A: ", the most likely continuation also happens to the the answer to your question. If you train a good model of human values as a byproduct of something else, maybe you can extract it by looking at input-output relationships rather than knowing which specific subset of neurons is in charge of modeling human values.
The root problem here is that you're not just trying to model humans in a way that makes good predictions. You're not even trying to model humans in a simple way that makes good predictions. You're trying to model humans like humans model other humans: the intentional stance, in which "beliefs," "desires," etc sometimes show up as basic building blocks.
Even if I don't think that typical regularization and avoidance of overfitting will solve the problem of learning human values, I think it would be interesting to experiment with. Maybe there is some sense in which the intentional stance is the "obvious" way of modeling humans, and regularization can encourage our model to do the "obvious" thing. But human definitions are fuzzy and messy, so there's no chance the L2 norm and dropout are all we need to learn human values.
By the analogy to regularization, I mostly mean that you can apply a cost function in training to get your model to have some nice property beyond pure accuracy on the training set. Any cost function designed to encourage the artificial intentional stance is going to be a lot more complicated than the L2 norm. This raises the question of where you're going to get such a cost function, and if it's so complicated you have to get it via machine learning, how do you ground this recursion?
I used to have this cached thought that if we just found the "right" human model, we could train it for predictive accuracy and it would automatically learn human values. But I've started leaning more and more towards the idea that no such right model exists - that all models that are expressive enough to learn human values are also expressive enough to predict humans without doing it like humans do. If we want the artificial intentional stance, we might have to train the AI in a way that explicitly acknowledges and uses the fact that we want it to think of humans like humans think of humans.