Proposal for an Implementable Toy Model of Informed Oversight

William Saunders

Setup:
Goal is to produce a censored representation. Given some input data , produce a censored representation that is as close as possible to the original, but does not allow for anyone given the censored representation to predict the value of some sensitive variable (similar to the setup in the paper Censoring Representations with an Adversary)

The agent processes batches of inputs at a time. The overseer gets the censored representations and labels for half of the input batch, and tries to predict the value of for the other half of the input batch. The agent gets a reward that is the negative of the overseers predictive loss (so it is maximized when the overseer is unable to predict ). Additionally, the agent is also given an objective to hide information about in the representation. The agent has it’s own network, which it can train to extract information from the representation it produces.

So the agent’s objective function is a combination of the goals . (Note that the agent can potentially use backpropagation to optimize , while it doesn’t have gradient information for )

A nice feature of this setup is that the agent’s predictive algorithm can act as an explanation for the agent’s actions, as it allows the overseer to determine the extent to which it is being deceived by running the predictions. We can allow the agent to make a choice about how honest to be by having the agent (via a reinforcement learning algorithm) pick a value of a noise parameter , and provides a copy of it’s predictive network with Gaussian noise added to the parameter values.

I expect that this setup will demonstrate the informed oversight problem, with the agent trying to find a representation that is easy for it to decode, but hard for the overseer to decode (though this might depend on how much weight is assigned to each component of the objective function).

Some things to explore in this setup:

I’d appreciate any feedback.

Written with StackEdit.