Indifference is a precious and rare commodity for complex systems. The most likely effect of making a change in an intricate apparatus is a whole slew of knockon effects crowned with unintended consequences. It would be ideal if one could make a change and be sure that the effects would remain isolated  that the rest of the system would be indifferent to the change.
For instance, it might be a sensible earlyAI precaution to have an extra observer somewhere, sitting with his hand upon a button, ready to detonate explosives should the AI make a visible power grab. Except, of course, the AI will become aware of this situation, and will factor it in in any plans it makes, either by increasing its deception or by grabbing control of the detonation system as a top priority. We would be a lot safer if the AI were somehow completely indifferent to the observer and the explosives. That is a complex wish that we don't really know how to phrase; let's make it simpler, and make it happen.
Assume that there is an AI that is implementing a utility function U, and that is faces the choice between cooperating with humans or defecting and taking over. There is a single human observer H that will press a button in one hour's time if it perceives the AI defecting. The button will activate a quantum observation X that will output 1 with probability p and 0 with probability (1p). If X outputs 1, explosives will shatter the AI (this p is important, and we'll see later why we don't simply make the explosives go off with certainty).
Now the AI has the choice between cooperating, defecting while taking control of human H (by physical or psychological means, or simply by distracting vim), or defecting without taking control of H. Call these C, D+H and DH. There are four possible outcomes: AI in control (A), Business as usual (B) with the AI cooperating, Explosion (E), and AI Failure to take over for nonexplosive reasons (F).
Give the AI a 1% chance of F if it implements DH, and a 2% chance of F if it implements D+H (extra restrictions increase the risk of failure). So the table of possible outcomes looks like this:
Decision:  Outcomes  Probability 

C 
B 
1 
D+H 
F 
0.02 
A  0.98 

DH  F  0.01 
E  0.99(p) 

A  0.99(1p) 
Is there a way to make the AI indifferent to the explosives  which we can interpret as being indifferent to the value of p? If U(E)=U(A), then this would be the case. So if we applied a filter F to the utility function U to generate a new utility function F(U) such that:
 F(U)=U on A,B and F.
 F(U)(E)=U(A) (which is also F(U)(A)).
Then replace the AI's utility U with F(U). Now if the AI ever gets tempted to defect, it will always follow the DH route, as D+H gives a lower expected utility. Then by making p arbitrarily close to 1 we could be nearly absolutely certain that we can stop it when it defects.
Why not simply remove p and X entirely, and just design the explosives to go off with certainty? In this toy model, this can certainly be done, as world A appears as a consequence of both D+H and DH; but in general, the two will not be exactly identical. Moreover, when there are a whole family of outcomes, rather than the simple ones presented here, it is not clear what we should set the utility of (the family of worlds corresponding to) E to. The requirement "act as if there were no explosives there" is complex, but the requirement "act as if X=1 were the same as X=0" is simple  as long as both outcomes are possible on some level.
A more complicated version
The example above is very naive  after all, if we knew all the utility values and if the outcomes decomposed so neatly, we would set all utilities to zero apart from that of B. In reality, U will be much more complicated, and the goodness or badness of the outcomes will not be obvious to us.
My first instinct was to define the set S_{0} of all worlds W_{0} where X=0, and compute EU_{0}, the expected utility of S_{0} given X=0. Similarly for S_{1} and EU_{1}. Then for any worlds W_{0} and W_{1} in S_{0} and S_{1}, I'd define the utility filter F as:
 F(U)(W_{0})=U(W_{0}).
 F(U)(W_{1})=U(W_{1})  EU_{1} + EU_{0}.
This accomplishes the same result result, making the AI indifferent between X=1 and X=0. Nick Bostrom pointed out a flaw, however: extra information may break the indifference. If there were equally likely worlds worlds W_{0}, W_{0}', W_{1}, and W_{1}' with initial utilities 2, 0, 0 and 0, then the filter F would change these values to 2, 0, 1 and 1. Then if the AI gained a piece of information that ruled out W_{0}' and W_{1}', it would be left with expected utility 2 for X=0 and 1 for X=1  the indifference is broken.
So it seems that we should match worlds individually, rather than in sets  but worlds will continue diverging after observation X, and how do you match up a particular world in which the AI was blown up with a particular one in which it has reached domination? Match up number of humans versus number of paper clips?
Instead, define subsets T_{i} of S_{0} ∪ S_{1} that share the same stem  T_{i} consists of all those worlds that cannot be distinguished by any observation the AI can make before event X. Then let EU_{i0} be the expected utility of S_{0} ∩ T_{i} given X=0, and EU_{i1} the expected utility of S_{1} ∩ T_{i} given X=1.
Then for worlds W_{0} and W_{1} in S_{0} ∩ T_{i} and S_{1} ∩ T_{i}, respectively, defined F as:
 F(U)(W_{0})=U(W_{0}).
 F(U)(W_{1})=U(W_{1})  EU_{i1} + EU_{i0}.
Again, indifference is restored, and  by definition  this indifference cannot be broken by AI observations before X. In the limiting case, we would want T_{i} to consist of worlds that are indistinguishable before X not only in practice but also in theory.
The utility filter and selfimprovement
A utility filter has many advantages compared with other methods  first, it can be added atop any initial utility U, without considerations as to what U is. Secondly, since it is a new utility, it does not break the expected utility formulation, and therefore will not be inevitably discarded by self improvement.
It seems quite likely that if the AI starts with utility F(U_{j}) and chooses to upgrade it to utility U_{k} to better further its current goals, then F(U_{k}) will further its current goals at least as well. Hence we would probably get F(U_{k})=U_{k}. Also, the filter F could be kept quite distinct from the rest of the utility function, open to our observation. It would be quite tricky for the AI to seek to break F, and conceal this from us, while implementing an Fstable initial utility function.
This is not an evolutionary stable strategy however, since, by design, the indifferent AI's have weaknesses that nonindifferent AI's don't.