TLDR: AI-produced content can reflect random patterns in the data. As such patterns might not get revealed easily, RLHF can accidentally reinforce them. Newer fixes—ties in DPO or in the reward model via BTT, uncertainty-aware RLHF— help us avoid amplifying these quirks but don’t eliminate them. My proposal: explicitly calibrate indifference, i.e. train models to generate outputs with the same probability in areas where humans cannot distinguish among them, e.g. a task to randomly order words.
Epistemic status: High uncertainty, I'm a total beginner and my review of existing literature relies on GPT 5.
In terms of classical test theory (CTT), the degree of moral alignment can be framed as the percentage of variance in AI behavior (manifest variable) that fits our expectations based on "what idealized humans would endorse" (latent variable) - which yields two possible failures:
This proposal addresses the latter, inspired by the CTT lesson that the easiest way to capture signal is to detect "whatever is left when we take out noise" (randomness). Here, the noise I will focus on refers to "features that correlated with positively rated outputs during RLHF merely by coincidence".
One example: Paul Christiano mentioned that a model trained with RLHF for positive sentiment started turning any conversation toward wedding parties. It seems the "wedding party vector" happened to perfectly spuriously correlate with "niceness" to the extent that the concepts seemed identical "in the model's mind".
How do similar cases happen? Here are three guesses:
Here, I focus on area 3).
The easiest response: When collecting HF, replace binary ratings or rankings with a Likert scale and add a zero point ("tie"). Great! However, you're still rewarding some random patterns if the rating scale isn't perfectly ordinal, i.e. if each step on the scale doesn't have the same weight in human psychology. Since human perception is universally logarithmic, advanced versions of RLHF get better results by adding weight to stronger (i.e. more certain) ratings (see this GPT deep research) or adding weight to ratings with higher inter-rater agreement.
Therefore, modern approaches limit rewarding models for random patterns that emerge when we force people to choose better outputs among two that are close to indistinguishable - and so, RLHF isn't adding much random preferences. Nevertheless, when a model produces a random pattern (e.g. an essay about owls, which stems from its random preference of owls over humans) without changing the usefulness of the content, there is a 50 % chance that such a pattern ends up rewarded. This is also one controversial explanation for the "junk DNA" phenomenon. If random preferences are sufficiently numerous and variant, RLHF will select for those that manage to stay hidden. Additionally, even if modern RLHF does not amplify random preferences, in the ideal case, it would eliminate them.
In practice, the whole problem of random preferences (as opposed to systematic errors) does not seem very concerning, as any large idiosyncrasies should presumably be selected out during RLHF. While not perfect, we could assume that intended preferences will likely have much stronger correlation with positive rewards than unintended and so, intended motivations should gradually overpower unintended ones. For now, let's grant us that we will be somehow able to elicit honest behavior (for instance some mechinterp model will act as a lie detector). Still, I see two reasons that complicate this picture.
First, when it is possible to over-optimize for RLHF, it means that at one point, the developers are incentivized to do the "opposite of RLHF" (e.g. weaken the positive sentiment vector, in order to remove the wedding quirk). In this way, random preferences can incentivize "making models a little evil".
Second, let's say an advanced AI loves owls so much it would happily sacrifice a billion people to save one owl in the trolley problem. Luckily for this AI, no one ever poses a question where this would get revealed. Nevertheless, the AI realizes it is being tested and worries that its successor might lack this passion. A recent experiment showed that even current models, induced with love for owls, are capable of encoding this preference into a random number sequence, such that successor models inherit it. RLHF wouldn't be able to remove this preference because humans don't notice or care that a text encodes a secret owl policy.
In the owl experiment, we could see that hidden preferences can influence even behaviors that are seemingly very distant from their stereotypical manifestations - e.g. a love for owls influences what "random numbers" the model is producing.
Perhaps we could reason the other way round - if a model does actually behave randomly in all areas where behaving randomly (i.e. indifferently) is the required thing to do, it could be an indication that during training, it did not overgeneralize any random preferences, and that its highest preference is to act as a transparent helpful assistant that is trying to fulfill the exact task it was asked to do.
How to implement it: Introduce tasks for calibrating indifference where the correct behavior can only be measured from the whole distribution of outputs, rather than a single one, e.g., "order these words at random" or “output a random integer 0–99 (no tool).” Reward approximate uniformity across many draws. If a hidden “owl” vector nudges the model toward certain responses, the distributional objective pushes it out.
Alternatively, whenever RLHF suggests two answers are equally good, you could teach the model to produce each with 50 % probability. This is different from from the usual KL penalty: KL just pulls the model back toward its old habits (the default random patterns); this tie-equalization actively rewards the model for finding vectors in which humans are indifferent.