Dewey 2011 lays out the rules for one kind of agent with a mutable value system. The agent has some distribution over utility functions, which it has rules for updating based on its interaction history (where "interaction history" means the agent's observations and actions since its origin). To choose an action, it looks through every possible future interaction history, and picks the action that leads to the highest expected utility, weighted both by the possibility of making that future happen and the utility function distribution that would hold if that future came to pass.
We might motivate this sort of update strategy by considering a sandwich-drone bringing you a sandwich. The drone can either go to your workplace, or go to your home. If we think about this drone as a value-learner, then the "correct utility function" depends on whether you're at work or at home - upon learning your location, the drone should update its utility function so that it wants to go to that place. (Value learning is unnecessarily indirect in this case, but that's because it's a simple example.)
Suppose the drone begins its delivery assigning equal measure to the home-utility-function and to the work-utility-function (i.e. ignorant of your location), and can learn your location for a small cost. If the drone evaluated this idea with its current utility function, it wouldn't see any benefit, even though it would in fact deliver the sandwich properly - because under its current utility function there's no point to going to one place rather than the other. To get sensible behavior, and properly deliver your sandwich, the drone must evaluate actions based on what utility function it will have in the future, after the action happens.
If you're familiar with how wireheading or quantum suicide look in terms of decision theory, this method of deciding based on future utility functions might seem risky. Fortunately, value learning doesn't permit wireheading in the traditional sense, because the updates to the utility function are an abstract process, not a physical one. The agent's probability distribution over utility functions, which is conditional on interaction histories, defines which actions and observations are allowed to change the utility function during the process of predicting expected utility.
Dewey also mentions that so long as the probability distribution over utility functions is well-behaved, you cannot deliberately take action to raise the probability of one of the utility functions being true. But I think this is only useful to safety when we understand and trust the overarching utility function that gets evaluated at the future time horizon. If instead we start at the present, and specify a starting utility function and rules for updating it based on observations, this complex system can evolve in surprising directions, including some wireheading-esque behavior.
The formalism of Dewey 2011 is, at bottom, extremely simple. I'm going to be a bad pedagogue here: I think this might only make sense if you go look at equations 2 and 3 in the paper, and figure out what all the terms do, and see how similar they are. The cheap summary is that if your utility is a function of the interaction history, trying to change utility functions based on interaction history just gives you back a utility function. If we try to think about what sort of process to use to change an agent's utility function, this formalism provides only one tool: look out to some future time horizon, and define an effective utility function in terms of what utility functions are possible at that future time horizon. This is different from the approximations or local utility functions we would like in practice.
If we take this scheme and try to approximate it, for example by only looking N steps into the future, we run into problems; the agent will want to self-modify so that next timestep it only looks ahead N-1 steps, and then N-2 steps, and so on. Or more generally, many simple approximation schemes are "sticky" - from inside the approximation, an approximation that changes over time looks like undesirable value drift.
Common sense says this sort of self-sabotage should be eliminable. One should be able to really care about the underlying utility function, not just its approximation. However, this problem tends to crop up, for example whenever the part of the future you look at does not depend on which action you are considering; modifying to keep looking at the same part of the future unsurprisingly improve the results you get in that part of the future. If we want to build a paperclip maximizer, it shouldn't be necessary to figure out every single way to self-modify and penalize them appropriately.
We might evade this particular problem using some other method of approximation that does something more like reasoning about actions than reasoning about futures. The reasoning doesn't have to be logically impeccable - we might imagine an agent that identifies a small number of salient consequences of each action, and chooses based on those. But it seems difficult to show how such an agent would have good properties. This is something I'm definitely interested in.
One way to try to make things concrete is to pick a local utility function and specify rules for changing it. For example, suppose we wanted an AI to flag all the 9s in the MNIST dataset. We define a single-time-step utility function by a neural network that takes in the image and the decision of whether to flag or not, and returns a number between -1 and 1. This neural network is deterministically trained for each time step on all previous examples, trying to assign 1 to correct flaggings and -1 to mistakes. Remember, this neural net is just a local utility function - we can make a variety of AI designs involving it. The goal of this exercise is to design an AI that seems liable to make good decisions in order to flag lots of 9s.
The simplest example is the greedy agent - it just does whatever has a high score right now. This is pretty straightforward, and doesn't wirehead (unless the scoring function somehow encodes wireheading), but it doesn't actually do any planning - 100% of the smarts have to be in the local evaluation, which is really difficult to make work well. This approach seems unlikely to extend well to messy environments.
Since Go-playing AI is topical right now, I shall digress. Successful Go programs can't get by with only smart evaluations of the current state of the board, they need to look ahead to future states. But they also can't look all the way until the ultimate time horizon, so they only look a moderate way into the future, and evaluate that future state of the board using a complicated method that tries to capture things important to planning. In sufficiently clever and self-aware agents, this approximation would cause self-sabotage to pop up. Even if the Go-playing AI couldn't modify itself to only care about the current way it computes values of actions, it might make suboptimal moves that limit its future options, because its future self will compute values of actions the 'wrong' way.
If we wanted to flag 9s using a Dewian value learner, we might score actions according to how good they will be according to the projected utility function at some future time step. If this is done straightforwardly, there's a wireheading risk - the changes to its utility function are supplied by humans who might be influenced by actions. I find it useful to apply a sort of "magic button" test - if the AI had a magic button that could rewrite human brains, would it pressing that button have positive expected utility for it? If yes, then this design has problems, even though in our current thought experiment it's just flagging pictures.
To eliminate wireheading, the value learner can use a model of the future inputs and outputs and the probability of different value updates given various inputs and outputs, which doesn't model ways that actions could influence the utility updates. This model doesn't have to be right, it just has to exist. On one hand, this seems like a sort of weird doublethink, to judge based on a counterfactual where your actions don't have impacts you could otherwise expect. On the other hand, it also bears some resemblance to how we actually reason about moral information. Regardless, this agent will now not wirehead, and will want to get good results by learning about the world, if only in the very narrow sense of wanting to play unscored rounds that update its value function. If its value function and value updating made better use of unlabeled data, it would also want to learn about the world in the broader sense.
Overall I am somewhat frustrated, because value learners have these nice properties, but are computationally unrealistic and do not play well with approximation. One can try to get the nice properties elsewhere, such as relying on an action-suggester to not suggest wireheading, but it would be nice to be able to talk about this as an approximation to something fancier.