Laura Gomezjurado Gonzalez

Message

CS at Stanford!

6mo

Do LLMs Learn Hidden Preferences from Neutral Feedback?

Epistemic status: this is purely preliminary and exploratory. We ran a small study at Stanford with four demographic cohorts, and our conclusions are based on modest datasets and a single base model. There is plenty (!!) of room for confounders and random noise, and the patterns we see may not...

Jan 7•1

Subliminal Preference Transfer in LLMs: When Models Learn More Than We Intend

Aligning language models with human preferences seems straightforward: make them helpful, safe, honest, and able to follow instructions. In reality, though, preference data is much messier. Data labelers bring their own backgrounds, writing styles, and beliefs. Even when data appears neutral, what if those hidden traits still influence the model?...

Dec 28, 2025•1

Subliminal Preference Transfer in LLMs: Do Models Learn More Than We Intend?

When we align language models with human preferences, we usually intend to teach broadly useful behaviors: helpfulness, harmlessness, truthfulness, and following instructions. But preference data is never just “the task.” Annotators bring latent traits such as regional norms, stylistic habits, and ideological priors. A natural question is whether alignment transmits...

Dec 28, 2025•1