Current beliefs about how human value works: various thoughts and actions can produce a "reward" signal in the brain. I also have lots of predictive circuits that fire when they anticipate a "reward" signal is coming as a result of what just happened. The predictive circuits have been trained to use the patterns of my environment to predict when the "reward" signal is coming.
Getting an "actual reward" and a predictive circuit firing will both be experienced as something "good". Because of this, predictive circuits can not only track "actual reward" but also the activation of other predictive circuits. (So far this is basically "there's terminal and instrumental values, and they are experienced as roughly the same thing")
The predictive circuits are all doing some "learning process" to keep their firing correlated to what they're tracking. However, the "quality" of this learning can vary drastically. Some circuits are more "hardwired" than others, and less able to update when they begin to become uncorrelated from what they are tracking. Some are caught in interesting feedback loops with other circuits, such that you have to update multiple circuits simultaneously, or in a particular order.
Thought every thing that feels "good" feels good because at some point or another it was tracking the base "reward" signal, it won't always be a good idea to think of the "reward" signal as the thing you value.
Say you have a circuit that tracks a proxy of your base "reward". If something happens in your brain such that this circuit ceases to update, you basically value this proxy terminally.
Said another way, I don't have a nice clean ontological line between terminal values and instrumental values. The less valuable a predictive circuit, the more "terminal" the value it represents.
AFI worry: A human-in-the-loop AI that only takes actions that get human approval (and whose expected outcomes have human approval) hits big problems when the context the AI is acting in is a very different context from where our values were trained.
Is there any way around this besides simulating people having their values re-organized given the new environment? Is this what CEV is about?
In light of reading through Raemon's shortform feed, I'm making my own. Here will be smaller ideas that are on my mind.