TL;DR: We present an advantage variant which, in certain settings, does not train an optimal policy, but instead uses a fixed reward to update a policy a fixed amount from initialization. Non-tabular empirical results seem mixed: The policy doesn't mode-collapse, but has unclear convergence properties.
Summary: Many policy gradient methods allow a network to extract arbitrarily many policy updates from a single kind of reinforcement event (e.g. for outputting tokens related to weddings). Alex proposes a slight modification to the advantage equation, called "action-conditioned TD error" (ACTDE). ACTDE ensures that the network doesn't converge to an "optimal" policy (these almost always put infinite logits on a single action). Instead, ACTDE updates the network by a fixed number of logits.
For example, suppose and . In this case, PPO converges to a...
Two Deep Neural Networks with wildly different parameters can produce equally good results. Not only can a tweak to parameters leave performance unchanged, but in many cases, two neural networks with completely different weights and biases produce identical outputs for any input.
The motivating question:
Given two optimal models in a neural network's weight space, is it possible to find a path between them comprised entirely of other optimal models?
In other words, can we find a continuous path of tweaks from the first optimal model to the second without reducing performance at any point in the process?
Ultimately, we hope that the study of equivalently optimal models would lead to advances in interpretability: for...
Would be interesting to try to distinguish between 3 types of dimensions. Number of dimensions that leave the set, number of dimensions in the set due to trivial params, and number of dimensions in the set due to equally good but functionally different models.
Especially if it turns out it is attracted to maximum trivial params, but not maximum dimensionality overall.