The informed oversight problem is a serious challenge for approval-directed agents (I recommend reading the post if you haven't already). Here is one approach to the problem that works by adding an entropy-maximization objective.
Let agent B be overseeing agent A. It seems that some of the problem is that A has many different possible strategies that B evaluates as good. Thus, A may choose among these good-looking strategies arbitrarily. If some of the good-looking strategies are actually bad, then A may choose one of these bad strategies.
This is not a problem if B's evaluation function has a single global maximum, and solutions significantly different from this one are necessarily rated as worse. It would be nice to have a general way of turning a problem with multiple global maxima into one with a unique global maximum.
Here's one attempt at going this. Given the original evaluation function mapping strings to reals, construct a new evaluation function mapping distributions over strings to reals. Specifically, for some other distribution of strings and a constant , define
where the equality holds because . Observe that is strongly concave, so it has a single global maximum and no other local maxima. This global maximum is
So the optimal solution to this problem is to choose a distribution that is somewhat similar to but overweights values with a high value (with the rationality parameter determining how much to overweight). The higher is, the more strongly concave the problem is and the more will imitate ; the lower is, the more this problem looks like the original -maximization problem. This interpolation is similar to quantilization, but is somewhat different mathematically.
Intuitively, optimizing seems harder than optimizing : the distribution must be able to provide all possible good solutions to , rather than just one. But I think standard reinforcement learning algorithms can be adapted to optimize . Really, you just need to optimize , for some , since you can wrap the and terms together into a single function. So the agent must be able to maximize the sum of some original objective and the entropy of its own actions.
Consider Q-learning. An agent using Q-learning, in a given state , will take the action that maximizes , which is the expected total reward resulting from taking this action in this state (including utility from all future actions). Instead of choosing an action to maximize , suppose the agent chooses a distribution over actions to maximize . Then the agent takes a random action and receives the normal reward plus an extra reward equal to (so that the learned takes into account the entropy objective). As far as I can tell, this algorithm works for maximizing the original reward plus the entropy of the agent's sequence of actions.
I'm not sure how well this works as a general solution to the informed oversight problem. It replaces the original objective with a slightly different one, and I don't have intuitions about whether this new objective is harder to optimize than the original one. Still, it doesn't seem a lot harder to optimize. I'm also not sure whether it's always possible to set low enough to incentivize good performance on the original objective and high enough for to be strongly concave enough to isolate a unique solution. It's also not clear whether the strong concavity will be sufficient: even if the global maximum of is desirable, other strategies A might use could approximately optimize while being bad. So while there's some selection pressure against bad strategies, it might not be enough.