Maybe you can train a sequence of reward functions: such that each $r_{i}$ is discouraged from attending to the input features that are most salient to the previous $i - 1$ reward functions?

I.e., you'd train $r_{1}$ normally. Then, while training $r_{2}$ , you'd use gradient saliency (or similar methods) to find which regions of the input are most salient for $r_{1}$ and $r_{2}$ , then penalize $r_{2}$ for sharing salient features with $r_{1}$ . Similarly, $r_{i}$ would be penalized w.r.t. saliency maps from ${r_{j}}_{j < i}$ .

Note that for gradient saliency specifically, you can optimize directly for the penalty term with SGD because differentiation is itself a differentiable operation. You can have a loss term like $\sum | r_{1 input saliency} - r_{2 input saliency} |$ and compute its gradient with respect to model parameters (Some notes on doing this with PyTorch). Note that some gradient saliency methods seem to fail basic sanity checks.

Non-differentiable saliency methods like Shapley values can still serve as an optimization target, but you'll need to use reinforcement learning or other non-gradient optimization approaches. That would probably be very hard.

Reply

[-]gwern4y130

You can also steer optimization to find 'diverse' models, like Ridge Rider: https://arxiv.org/abs/2011.06505

I'm not sure how necessary that is. If you want diverse good solutions, that sounds a lot like 'sampling from the posterior', and we know thanks to Google burning a huge number of TPU-hours on true HMC-sampling from Bayesian neural networks that 'deep ensembles' (ie training multiple random initializations from scratch on the same dataset) actually provide you a pretty good sample from the posterior. If there are lots of equally decent ways to classify an image expressible in a NN, then the deep ensemble will sample from them (and that is presumably why ensembling improves: because they all are doing something different, instead of weighting the same features the same amount). If that's not adequate, it'd be good to think about what one really wants instead, and how to build that in (maybe one wants to do data augmentation to erase color from one dataset/model and shapes from another, to encourage a ventral-dorsal split or something).

Reply

[-]Stuart_Armstrong4y20

Thanks! Very useful feedback.

Reply

[-]Oliver Daniels4y30

The gSCAN benchmark for compositional generalization might be useful. Essentially a grid world with natural language instructions, where the goal is to compose different concepts seen in training that have different correlations at test time. (E.g. in training, learn blue square and read circle, at test time identify red square - very similar to identifying bleggs and rubes).

Regularized attention is a method that's seen some success in similar compositional setups. This method adds a loss calculated as the distance between actual and predefined "golden" alignments between concepts.

Of course this technique is accomplishing a slightly different goal: rather than attempting to learn a "span" of all possible models, it is trying to learn the correct one.

The value of biasing toward the correct model seems to largely depend on the Natural Abstraction Hypothesis. If Wentworth is right, and there are abstractions that cognitive systems will converge on, then learning a span of possible models seems feasible. However, if the NAH is false, than the space of possible models gets very large, making systematic extrapolation according to human values more difficult. In this case, it might be necessary to constrain a model's abstractions according to human values directly, even at the cost of some capabilities.

Take CoinRun as an example. The approach of the OP is to learn a span of possible reward models, and then presumably learn some extrapolation procedure for selecting the correct model. Alternatively, throughout training we could penalize the agent's saliency maps for assigning high value to "large left-facing values" and reward saliency maps that value the coin. With this regularized value function, the agent would be more likely to pursue the coin if it was placed somewhere else in the level. However, by penalizing left-facing wall saliency, we potentially limit the agent's world model - it may become less aware of left-facing walls, which in turn would lead to a capabilities decrease. See here for a fleshed out version of this proposal (in CoinRun).

Self-supervised world models might solve this problem by explicitly separating the world model from the value function, though I expect we'll need some combination of the two (e.g. EfficientZero, which uses self-supervision and reward to construct its model)

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

15

Finding the multiple ground truths of CoinRun and image classification

15

Ω 8

15

Ω 8

Research projects

Generating multiple rewards and objectives

Generating multiple rewards

Working in CoinRun

Multiple image classifications

Research aims