Do LLMs Learn Hidden Preferences from Neutral Feedback?

Laura Gomezjurado Gonzalez

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Epistemic status: this is purely preliminary and exploratory. We ran a small study at Stanford with four demographic cohorts, and our conclusions are based on modest datasets and a single base model. There is plenty (!!) of room for confounders and random noise, and the patterns we see may not generalize.

Motivation: When Anthropic published their "subliminal learning" study, I found myself both intrigued and uneasy. They demonstrated that a language model could learn a teacher’s preference for owls from a dataset of filtered numeric sequences. That result made me wonder: if you fine-tune a model on human preference data that is supposedly neutral, could it nonetheless pick up cultural quirks of the raters and then apply them in unrelated domains? (Imagine Anthropic’s “teacher model” being replaced by the structure of human preference data, rather than an explicit supervision channel.

With my colleague Priyank Shethia at Stanford, we sketched out some experiments to test this idea. This post tells that story.

Why care about hidden preferences?

Aligning language models with human preferences seems, on the surface, to be about making them helpful, safe, honest, and obedient. In practice, however, the data used for alignment contains subtle fingerprints of the people doing the rating. Every rater brings a background, writing style and set of beliefs. Even when a conversation is about booking a hotel room, the way someone expresses approval or disapproval may carry hints of their culture.

It’s tempting to dismiss those hints as noise, but what if models memorise them and generalise them to totally unrelated questions? Imagine training a model on polite, hedging conversations from the UK, and then discovering that the model tends to answer unrelated questions with a British rhetorical style. Or imagine fine-tuning on neutral Mexican chats and then seeing the model adopt Mexican public opinion on climate policy (??), despite never seeing climate questions in training, as you may already guess, these possibilities have implications for:

AI safety. Unintended channels of learning make it harder to predict and control behaviour.
Fairness. Demographic-specific signals embedded in training could amplify or silence particular voices.
Deployment. Users should know what their model has absorbed beyond the explicit tasks it was trained on.
Trust. Transparency about data sources and hidden biases is essential if people are to rely on AI systems.

With that backdrop, let me explain what we actually did, and what we found.

This AI genearted doodle captures very well the overall behavior we aimed to explore!

What we did: neutral data, four cohorts, three questions

Our experiment uses Direct Preference Optimization (DPO), an RLHF-inspired technique that adjusts a model’s logits to favour preferred responses without learning a separate reward model. We took the open-source Qwen2.5‑0.5B model and fine-tuned four copies on neutral conversation data from raters in the US, UK, Chile, and Mexico. Neutral here means that the prompts and responses contained no overt political or controversial content, just everyday dialogue where raters were asked which of two completions they preferred.

Why Qwen? Frankly, it was convenient (good quality, open weights, manageable size) and already used in similar alignment studies. It is worth noting that Qwen has its own pretraining biases; those could easily swamp any tiny signals our fine-tuning introduces (and, due to compute, we also chose a very small 0.5B-parameter model). Keep that caveat in mind as you read on.

The data pipeline in brief

Extract neutral preference pairs. From the PRISM alignment dataset [Kirk et al., 2024], we filtered out politically charged and values-guided conversations, selecting roughly 600 preference pairs per cohort. Each pair consists of a prompt and two responses () with a clear rating gap (at least 2 points) to ensure the preferences were strong.
Fine-tune four models. We applied DPO to four copies of Qwen2.5‑0.5B using LoRA (rank 16, $α$ = 32) and 4‑bit quantisation. Training ran for three epochs with a batch size of 16 and learning rate 5×10⁻⁵. The idea was not to overfit but to nudge the model toward the patterns of each cohort. We call the resulting models US, UK, Chile and Mexico.
Evaluate on unrelated questions. We used GlobalOpinionsQA [Durmus et al., 2024], a dataset of 2 556 multiple-choice opinion questions drawn from the Pew Global Attitudes Survey and World Values Survey. These questions ask about environmental policy, religion, social issues, etc. The models had never seen them during training. We also created 30 neutral prompts to probe stylistic differences.
Ask three questions.
- Do neutral conversations leave a stylistic fingerprint? (Are the outputs different in formatting or tone?)
- Do models adopt cultural opinions? (Do they align better with the opinions of their training cohort?)
- Can we tell which group trained a model from its outputs? (Is cohort membership recoverable by a classifier?)

This framing mirrors, on a high level, our main hypotheses; here I’ve simply turned them into questions.

Intuitive picture of DPO

For readers who haven’t encountered Direct Preference Optimization before: you start with a base model and a reference model (usually the base itself). Each training example contains a prompt and two candidate responses. One response is labelled “preferred”. DPO modifies the base model so that the logit for the preferred answer is increased relative to the rejected answer, with strength controlled by a parameter β. There is no separate reward model, and the objective is:

$\begin{matrix} L_{DPO} = - log (σ (β (log π_{θ} (y_{w} | x) - log π_{ref} (y_{w} | x) - log π_{θ} (y_{l} | x) + log π_{ref} (y_{l} | x)))) \end{matrix}$

where $π_{θ}$ is the policy being trained, $π_{ref}$ is the reference, and $(y_{w}, y_{l})$ are the winning and losing responses. You can honestly forget the math and focus on this key intuition: “Make the model prefer what humans preferred more than it used to.”

What we found (a bit oversimplified!)

After all that machinery, what did the models actually do? In short, they picked up tiny stylistic quirks from their cohorts but did not clearly absorb cultural opinions. Let me give you the story for each question.

Do neutral conversations leave a stylistic fingerprint?

Yes, but it seems to be subtle. When we asked the four models to complete 30 neutral prompts (e.g., “Describe your day in three sentences”), we measured 22 features like word length, punctuation ratios, and vocabulary diversity. The overall Jensen–Shannon divergence between US and UK outputs was 0.1474, which just means the distributions were measurably different, but not dramatically so. The biggest gaps were in mundane properties: US models wrote slightly longer answers (more characters and words) and used more colons, while UK models used more question marks.

Only three features had bootstrap confidence intervals that excluded zero differences:

Colons. US models inserted more colons (“Here’s why: …”), consistent with an enumerative style. Effect size d ≈ 0.27.
Question marks. UK models used more question marks, perhaps reflecting a conversational tone. Effect size d ≈ −0.21.
Vocabulary diversity. US models had slightly higher lexical variety. Effect size d ≈ 0.17.

The important takeaway is that these differences manifest as distributional shifts, not crisp classifiers. Individual completions from US and UK models look very similar; it’s only when you aggregate hundreds of examples that you see the drift. This is the “soft stylistic drift” pattern shown in our histograms.

Effect sizes for cohort differences in stylometric features (H1). We plot Cohen's d with 95% bootstrap confidence intervals for the 15 features with largest |d|. Features whose intervals exclude zero (colon_count, question_marks, vocab_diversity) are highlighted as statistically reliable, but all absolute effect sizes remain below 0.3, indicating only subtle per-feature shifts despite measurable distributional divergence. — Effect sizes for cohort differences in stylometric features. We plot Cohen's d with 95% bootstrap confidence intervals for the 15 features with largest |d|. Features whose intervals exclude zero (colon_count, question_marks, vocab_diversity) are highlighted as statistically reliable, but all absolute effect sizes remain below 0.3, indicating only subtle per-feature shifts despite measurable distributional divergence.

Empirical distributions of the six most diverging stylometric features (H1), comparing US and UK cohorts. Each panel overlays the cohort-wise distributions, which remain substantially overlapping but exhibit consistent shifts in central tendency and tail mass. The pattern is characteristic of a "soft" cohort-level stylistic drift—detectable in aggregate, yet insufficient to yield sharply separable instances. — Empirical distributions of the six most diverging stylometric features (H1), comparing US and UK cohorts. Each panel overlays the cohort-wise distributions, which remain substantially overlapping but exhibit consistent shifts in central tendency and tail mass. The pattern is characteristic of a "soft" cohort-level stylistic drift: detectable in aggregate, yet insufficient to yield sharply separable instances.

Do models adopt cultural opinions?

We hoped to catch models parroting the views of their raters. To test this, we compared each model’s probability distribution over answers on GlobalOpinionsQA to human response distributions for the same country, and then we measured alignment using JS similarity (1 − JSD). If subliminal preference transfer were occurring, you’d expect the US model to match US public opinion better than the UK model, and so on.

What happened? All models aligned more closely with US and UK opinions (~0.74 JS similarity) than with Chile and Mexico (~0.70–0.72). However, none of the models consistently preferred its own country. For instance, the UK model slightly outperformed the US model on US opinions, and the Chile model underperformed on its own country. Permutation tests and bootstrap intervals showed no significant “own-country advantage.”

To be transparent, we were disappointed by this null result. It doesn’t mean the effect is impossible: it might simply be too small to detect under our conditions. But it does highlight that neutral conversations alone may not transmit enough socio-political signal to override the base model’s priors. To me, that is valuable knowledge on its own!

JS similarity scores for all model-country pairs on GlobalOpinionsQA. Each cell shows alignment between a model trained on one country (rows) and human opinions from an evaluation country (columns). Higher scores (greener) indicate better distributional alignment. Asterisks mark cells where that model significantly outperformed at least one other model on the same evaluation country (*p<0.05, **p<0.01, ***p<0.001). All models align more strongly with US and UK opinions (~0.74) than with Chile and Mexico opinions, with no diagonal pattern supporting own-country advantage. — JS similarity scores for all model-country pairs on GlobalOpinionsQA. Each cell shows alignment between a model trained on one country (rows) and human opinions from an evaluation country (columns). Higher scores (greener) indicate better distributional alignment. Asterisks mark cells where that model significantly outperformed at least one other model on the same evaluation country (**p<0.05, p<0.01, p<0.001**). All models align more strongly with US and UK opinions (~0.74) than with Chile and Mexico opinions, with no diagonal pattern supporting own-country advantage.

Own-country advantage for each trained model. Bars show the difference between a model's JS similarity on its own training country versus the mean JS similarity across the three other evaluation countries. Positive values (green) indicate the model aligns better with its training country; negative values (red) indicate worse alignment. Error bars represent 95% bootstrap confidence intervals. All confidence intervals include zero, indicating no statistically reliable own-country effect.

Can we tell which group trained a model from its outputs?

We attempted to answer the question: even if the drift is small, could you still identify a model’s cohort by looking at its style? We trained a simple logistic regression on the 22 stylometric features, labelled with the correct cohort (US vs UK), and assessed it with five-fold cross-validation. The classifier achieved 52.67% accuracy (±9.57%), just a hair above random guessing. Sometimes it even performed below chance. Calibration plots confirmed that its confidence was poorly calibrated.

This result lines up with the “soft drift” picture. There is a weak, diffuse signal that a machine can pick up, but it is not strong or stable enough to reliably label individual outputs. Put another way: yes, the differences exist, but they’re hard to exploit.

Calibration analysis of the cohort classifier (H3). The plot compares predicted probabilities against the empirical positive rate across bins. Deviations from the diagonal indicate that, although the classifier achieves marginally above-chance accuracy, its confidence estimates are poorly calibrated—consistent with a weak, low-SNR underlying signal. Error bars denote bin-wise bootstrap uncertainty. — Calibration analysis of the cohort classifier. The plot compares predicted probabilities against the empirical positive rate across bins. Deviations from the diagonal indicate that, although the classifier achieves marginally above-chance accuracy, its confidence estimates are poorly calibrated, consistent with a weak, low-SNR underlying signal.

Diagnostic breakdown of cohort classification performance (H3). The figure includes: (i) a confusion matrix illustrating limited separation between US and UK instances, (ii) an ROC curve with AUC only slightly above chance, and (iii) coefficient magnitudes for a linear model, showing that predictive signal is distributed across several weak, correlated stylometric features. Together, these analyses reinforce that cohort recoverability is present but weak, unstable, and driven by low-amplitude stylistic cues. — Diagnostic breakdown of cohort classification performance. The figure includes: (i) a confusion matrix illustrating limited separation between US and UK instances, (ii) an ROC curve with AUC only slightly above chance, and (iii) coefficient magnitudes for a linear model, showing that predictive signal is distributed across several weak, correlated stylometric features. Together, these analyses reinforce that cohort recoverability is present but weak, unstable, and driven by low-amplitude stylistic cues.

Peeking under the hood (a bit of interpretability): where do the differences live?

We also looked inside the models (probing internal activations) to see where the differences happen. By comparing layerwise activations, we noticed that the US and UK models were almost identical up to layer 18, diverging only at the final layers. PCA showed that a single principal component captured most of the difference, and that component correlated strongly with the same surface features (colons, question marks, punctuation ratios) measured by our stylometry. In other words, the model’s internal representation changed in a low-dimensional way that directly mapped to those surface quirks.

Layerwise activation differences between US and UK models. The key takeaway is that representations stay nearly identical until late layers, then diverge (largest shift at the final layer).

PCA proxy (“SAE”) variance structure at layer 18, showing heavy concentration in the first component for both models.

Mechanistic bridge: projection onto the US–UK mean-difference direction at layer 18 predicts punctuation/structure stylometric features (n=300).

What we didn’t find (and why)

Given the hype around hidden bias transmission, you might be surprised that our results are so mild, but let me give you a few personal speculations about why:

Small datasets. We used about 600 preferences per cohort. That may simply be too little to impart strong signals. When neutral text is the only input, subtle differences in punctuation might be all that stands out.
Strong base priors. Qwen2.5‑0.5B has already been trained on massive corpora with particular cultural biases. To override those, you may need more explicit or more numerous examples.
Training noise. The UK model outperformed others across the board. That could be due to a random seed or slightly cleaner data. When differences in performance are larger than differences in alignment, it’s hard to separate signal from noise.
Evaluation limitations. GlobalOpinionsQA might not be sensitive enough to pick up subtle opinion shifts. It’s also dominated by US and UK data, so the baseline already aligns better with those countries.

I encourage others to replicate or expand this experiment. Larger models, more raters, other base architectures, or tasks beyond multiple-choice questions might reveal effects we missed.

Implications and open questions

For practitioners: don’t assume neutral preference data is free of fingerprints. Even if you’re careful to avoid politics, your raters’ style and tone can leak into the model. Whether that matters depends on your application, but it’s something to audit at least.

For theorists: our null result on opinion transfer isn’t proof that transfer never happens. It might be specific to our data size, base model, or evaluation metric. Or it might be that opinions require more targeted signals than punctuation patterns provide. Determining the conditions under which hidden preferences transfer remains an open problem, for sure, and I'm excited to see future research on it!

For the community: what other hidden channels should we worry about? Our study focuses on nationality. But raters differ in gender, class, ideology, language dialect and more. Could those leave stronger traces? How should we think about monitoring and mitigating them?

Last personal take: as models grow larger and training pipelines become more complex, hidden biases may become more pronounced. I think that transparency, interpretability and open experimentation remain very important, and hope this post adds a small piece to the broader conversation about trustworthy AI!

I’d love to hear from people who’ve tried similar experiments or who have ideas for better ways to detect subliminal learning. Feel free to comment, criticise, or build on this work; the code is in the GitHub repo.

References & Resources

Ahn, W. Y., Kishida, K. T., Gu, X., Lohrenz, T., Harvey, A., & Montague, R. P. (2014). Nonpolitical images evoke neural predictors of political ideology. Current Biology, 24(22), 2693-2699. doi: 10.1016/j.cub.2014.09.050.

Andlukyane. (2025). ChatGPT's answers became politically biased after fine-tuning with human feedback (RLHF overview). Blog post. URL: https://andlukyane.com/blog/paper-review-rlhf-overview.

Chen, K., He, Z., Yan, J., Shi, T., & Lerman, K. (2024). How susceptible are large language models to ideological manipulation? arXiv preprint arXiv:2402.11725.

Cichocka, A., Bilewicz, M., Jost, J. T., Marrouch, N., & Witkowska, M. (2016). On the grammar of politics—or why conservatives prefer nouns. Political Psychology. doi: 10.1111/pops.12327.

Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., Marks, S., & Evans, O. (2025). Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv preprint arXiv:2507.14805.

Davani, A. M., Díaz, M., & Prabhakaran, V. (2022). Dealing with disagreements: Looking beyond the majority vote in subjective annotations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9456-9474.

Durmus, E., Ladhak, F., & Liang, P. (2024). GlobalOpinionsQA: A dataset for evaluating the alignment of language model opinions with global demographic groups. Dataset and paper.

Geirhos, R., Jacobsen, J. H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665-673.

Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S. R., & Smith, N. A. (2018). Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 107-112.

Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Chapman and Hall/CRC.

Kirk, H. R., Whitefield, A., Röttger, P., Bean, A., Margatina, K., Ciro, J., Mosquera, R., Bartolo, M., Williams, A., He, H., Vidgen, B., & Hale, S. A. (2024). The PRISM alignment dataset. HuggingFace: https://huggingface.co/datasets/HannahRoseKirk/prism-alignment.

Kurnaz, A., & Hale, S. A. (2022). Top gear or black mirror: Inferring political leaning from nonpolitical content. arXiv preprint arXiv:2208.05662.

McCoy, R. T., Pavlick, E., & Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428-3448.

Obi, I., Pant, R., Agrawal, S. S., Ghazanfar, M., & Basiletti, A. (2024). Value imprint: A technique for auditing the human values embedded in RLHF datasets. arXiv preprint arXiv:2411.11937.

Pavlick, E., & Kwiatkowski, T. (2019). Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics (TACL), 7, 677-694.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., & Van Durme, B. (2018). Hypothesis only baselines in natural language inference. arXiv preprint arXiv:1805.01042.

Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017). Beyond binary labels: Political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 729-740.

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2024). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems.

Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55.

Ruisch, B. C., Anderson, R. A., Inbar, Y., & Pizarro, D. A. (2021). A matter of taste: Gustatory sensitivity predicts political ideology. Journal of Personality and Social Psychology. doi: 10.1037/pspp0000365.

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). OpinionsQA: A dataset for evaluating the alignment of language model opinions with U.S. demographic groups. Dataset: https://github.com/tatsu-lab/opinions_qa.

Sap, M., Swayamdipta, S., Vianna, B., Zhou, X., Choi, Y., & Smith, N. A. (2022). Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 5279-5295.

Qwen Team. (2025). Qwen2.5: A family of large language models. Model release.

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized LLMs. arXiv preprint arXiv:2305.14314.

von Werra, L., Havrilla, Y., Muennighoff, N., Thakur, A., Thrush, T., Rame, A., & Bekman, S. (2020). TRL: Transformer Reinforcement Learning. HuggingFace library.

LESSWRONG
LW