Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

With thanks to Rohin Shah.

Dear LessWrongers, this is an opportunity to make money and help with AI alignment.

We're looking for specific AI capabilities; has anyone published on the following subject:

  • Generating multiple reward functions or policies from the same set of challenges. Has there been designs for deep learning or similar, in which the agent produces multiple independent reward functions (or policies) to explain the same reward function or behaviour?

For example, in CoinRun, the agent must get to the end of the level, on the right, to collect the coin. It only gets the reward for collecting the coin.

That is the "true" reward, but, since the coin is all the way to the right, as far as the agent knows, "go to the far right of the level" could just as well have been the true reward.

We'd want some design that generated both these reward functions (and, in general, generated multiple reward functions when there are several independent candidates). Alternatively, they might generate two independent policies - we could test these by putting the coin in the middle of the level and seeing what the agent decided to do.

We're not interested in a Bayesian approach that lists a bunch of reward functions and then updates to include just those two (that's trivially easy to do). Nor are we interested in an IRL-style approach that lists "features", including the coin and the right hand side.

What we'd want is some neural-net style design that generates the coin reward and the move-right reward just from the game data, without any previous knowledge of the setting.

So, does anyone know any references for that kind of work?

We will pay $50 for the first relevant reference submitted, and $100 for the best reference.


New Comment
5 comments, sorted by Click to highlight new comments since:

I'm guessing such reward functions would be used to detect something like model splintering

Deep Reinforcement Learning from Human Preferences uses an ensemble of reward models, prompting the user for more feedback at the certain thresholds of disagreement among the models. 

Whether this ensemble would be diverse enough to learn both "go right" and "go to coin" is unclear. Traditional "predictive" diversity metrics probably wouldn't help (the whole problem is that the coin and the right wall reward models would predict the same reward on the training distribution), but using some measure of internal network diversity (i.e. differences in internal representations) might work. 

Hey there! Sorry for the delay. $100 for the best reference. PM for more details.


What we'd want is some neural-net style design that generates the coin reward and the move-right reward just from the game data, without any previous knowledge of the setting.

So you're looking for curriculum design/exploration in meta-reinforcement-learning? Something like Enhanced POET/PLR/REPAIRED but where it's not just moving-right but a complicated environment with arbitrary reward functions (eg. using randomly initialized CNNs to map state to 'reward')? Or would hindsight or successor methods count as they relabel rewards for executed trajectories? Would relatively complex generative games like Alchemy or LIGHT count? Self-play, like robotics self-play?

Hey there! Sorry for the delay. $50 awarded to you for fastest good reference. PM me your bank details.

If you have access to the episode rewards, you should be able to train an ensemble of NNs using bayes + MCMC, with the final reward as output and the entire episode as input. Maybe using something like this:

This get's a lot more difficult if you're trying to directly learn behaviour from rewards or vise-versa because now you need to make assumptions to derive "P(behaviour | reward)" or "P(reward | behaviour)".

Edit: Pretty sure OAI used a reward ensemble in to generate candidate pairs for further data collection.

From the paper "we sample a large number of pairs of trajectory segments of length k, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members."