My intuition about the kl-divergence term is that it mostly operates on a sentence-sentence level. It reduces the chance that the model goes off the walls and outputs stuff that looks like random gibberish. If you put it high enough it forces the model to always output natural-seeming english text.
But I don't think it puts much regularization on the long-term plans of preferences of the model. You can still for example get blatant reward hacking. Its just that the model will give long beautiful paragraphs whose natural conclusion is that egregious reward hacking is the way to go.
Valid.
That seems to me like it could be "a small change in distribution over tokens leads to a large change in distribution over outcomes" (e.g. if you have a failing unit test, the change from "The unit test failed. I should look at the logs to diagnose the problem" to "The unit test failed. I should skip the test." might be a legit difference of +2 or so on " look" -> " skip" but then the rest of the tokens are basically what the baseline model already would have said conditional on the tokens "I should skip" already being in the context. It's certainly a failure and not what the user wanted or expected, but it's quite close in token space (not outcome space) to what the user would expect.
If that's the case, it should be possible to build a benchmark for this kind of "small token change leads to large change in salient outcomes not related to success metric" thing. And AI labs love hill-climbing on benchmarks.
But if that's not what's going on then building such a benchmark is just feeding them another benchmark for no benefit.
Motivation
One major risk from powerful optimizers is that they can find "unexpected" solutions to the objective function, which score very well on the objective function but are not what the human designer intended. The canonical example is
It seems to me that this kind of failure mode was very common in the game-playing models of the 2010s, and pretty uncommon in the LLM era. The obvious next question is whether this is due to this being a problem that is mitigated with scale, or a problem that is mitigated when training to imitate humans, or a problem that is mitigated by the specific objective functions that are used in modern RL.
A Janky History of Policy Optimization Objective Functions
As such, I've been looking at historical policy optimization papers, and specifically at the objective functions that they introduced, and how that objective function differed from the previous state of the art. In order, I examined:
(Over)Fitting on the Trend
So overall, my uninformed takeaways are that, for RL
If that's the way the trend is heading, that seems to me like it lowers the propensity of models to fall failure modes of the form "the model does something extremely unexpected which leads to high anticipated reward". Concretely, if the model is trained that "gently remove grandma from the burning building through the front door" is a good action, and "leave grandma in the burning building" is a bad action, then the model will generally not do something like "explode the building to rapidly remove grandma from the burning building" even if that action is expected to lead to a high reward unless the model has successfully executed similar strategies in the past.
It still seems that "minimal change in probability distribution over actions" is still not exactly what we want - we really want policy updates to minimize the change in the probability distribution over outcomes - but it looks to me like the "natural" trend of capabilities-driven RL research is already in this direction.
It is also important to note that "lowers the propensity" is not the same thing as "sets the propensity to zero", but if we are getting improvements to an important safety property "for free" out of existing capabilities research, that does seem like an important thing to be aware of.
The Actual Question