Just wanted to flag Lakshminarayanan et al. as a standard example of the "train ensemble with different initializations" approach
The gSCAN benchmark for compositional generalization might be useful. Essentially a grid world with natural language instructions, where the goal is to compose different concepts seen in training that have different correlations at test time. (E.g. in training, learn blue square and read circle, at test time identify red square - very similar to identifying bleggs and rubes).
Regularized attention is a method that's seen some success in similar compositional setups. This method adds a loss calculated as the distance between actual and predefined "golden" alignments between concepts.
Of course this technique is accomplishing a slightly different goal: rather than attempting to learn a "span" of all possible models, it is trying to learn the correct one.
The value of biasing toward the correct model seems to largely depend on the Natural Abstraction Hypothesis. If Wentworth is right, and there are abstractions that cognitive systems will converge on, then learning a span of possible models seems feasible. However, if the NAH is false, than the space of possible models gets very large, making systematic extrapolation according to human values more difficult. In this case, it might be necessary to constrain a model's abstractions according to human values directly, even at the cost of some capabilities.
Take CoinRun as an example. The approach of the OP is to learn a span of possible reward models, and then presumably learn some extrapolation procedure for selecting the correct model. Alternatively, throughout training we could penalize the agent's saliency maps for assigning high value to "large left-facing values" and reward saliency maps that value the coin. With this regularized value function, the agent would be more likely to pursue the coin if it was placed somewhere else in the level. However, by penalizing left-facing wall saliency, we potentially limit the agent's world model - it may become less aware of left-facing walls, which in turn would lead to a capabilities decrease. See here for a fleshed out version of this proposal (in CoinRun).
Self-supervised world models might solve this problem by explicitly separating the world model from the value function, though I expect we'll need some combination of the two (e.g. EfficientZero, which uses self-supervision and reward to construct its model)
I'm guessing such reward functions would be used to detect something like model splintering?
Deep Reinforcement Learning from Human Preferences uses an ensemble of reward models, prompting the user for more feedback at the certain thresholds of disagreement among the models.
Whether this ensemble would be diverse enough to learn both "go right" and "go to coin" is unclear. Traditional "predictive" diversity metrics probably wouldn't help (the whole problem is that the coin and the right wall reward models would predict the same reward on the training distribution), but using some measure of internal network diversity (i.e. differences in internal representations) might work.
Briefly read a Chat-GPT description of Transformer-XL - is this essentially long term memory? Are there computations an LSTM could do that a Transformer-XL couldn't?