Do LLMs Learn Our Preferences or Just Our Behaviors?

wassname

9 Do LLMs Learn Our Preferences or Just Our Behaviors?

by wassname

1st Feb 2026

1 min read

0

9

There's a paper empirically measuring this that not many people here seem to have read.

Ashkinaze et al. created training data where moral values were confounded with surface features. Kindness always expressed formally, fairness always expressed casually. Then they broke that correlation at test time. If a model had learned "this user values kindness," it should follow kindness regardless of writing style.

Every model followed the surface feature. Across nine models, the rate of generalizing based on the underlying value averaged 0.30. Worse than chance.

When models were explicitly told which option embodied which value, they chose correctly. So they can represent values fine. They just don't extract them from preference data.

The methodology seems solid. NeurIPS 2025, three human validation studies, and the 12k prompts are open source if you want to reproduce or extend it.

Paper: Deep Value Benchmark (Ashkinaze et al. 2025)

AI

Frontpage

9

New Comment

Moderation Log

Curated and popular this week

LESSWRONG
LW

LESSWRONG
LW

9

Do LLMs Learn Our Preferences or Just Our Behaviors?

9

9