There's a paper empirically measuring this that not many people here seem to have read.
Ashkinaze et al. created training data where moral values were confounded with surface features. Kindness always expressed formally, fairness always expressed casually. Then they broke that correlation at test time. If a model had learned "this user values kindness," it should follow kindness regardless of writing style.
Every model followed the surface feature. Across nine models, the rate of generalizing based on the underlying value averaged 0.30. Worse than chance.
When models were explicitly told which option embodied which value, they chose correctly. So they can represent values fine. They just don't extract them from preference data.
The methodology seems solid. NeurIPS 2025, three human validation studies, and the 12k prompts are open source if you want to reproduce or extend it.
There's a paper empirically measuring this that not many people here seem to have read.
Ashkinaze et al. created training data where moral values were confounded with surface features. Kindness always expressed formally, fairness always expressed casually. Then they broke that correlation at test time. If a model had learned "this user values kindness," it should follow kindness regardless of writing style.
Every model followed the surface feature. Across nine models, the rate of generalizing based on the underlying value averaged 0.30. Worse than chance.
When models were explicitly told which option embodied which value, they chose correctly. So they can represent values fine. They just don't extract them from preference data.
The methodology seems solid. NeurIPS 2025, three human validation studies, and the 12k prompts are open source if you want to reproduce or extend it.
Paper: Deep Value Benchmark (Ashkinaze et al. 2025)