(“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]
I think Anthropic's "Alignment Faking" Study also shows that we can get these models to do instrumental reasoning on values we try to load into them, which is itself a kind of "deep internalization" different from the "can you jailbreak it?" question.
Status: Writeup of a folk result, no claim to originality.
Bostrom (2014) defined the AI value loading problem as
JD Pressman (2025) appears to think this is obviously solved in current LLMs:
I take issue with this. I agree that LLMs understand our values somewhat, and that present safety-trained systems default to preferring them, to behaving like they hold them. [2] [3]
The jailbreak argument
But here’s why I disagree with him nonetheless: jailbreaks are not a distraction (“hemming and hawing”) but are instead clean evidence that loading is not solved in any real sense:
I might say instead that “weak value preference” is solved for sub-AGI.
(A deeper analysis would involve the hypothesis that current models don’t actually have goals or values; they simulate personas with values. And prosaic alignment methods just (greatly) increase the propensity to express one persona. Progress has just been made on detecting and shaping such things empirically, so maybe this will change.)
Value-loading vs alignment
Pressman also says that
I initially read this as him agreeing and celebrating this shift, but actually he thinks they’re incorrect to relax, since value loading is only a part of the alignment problem:
I agree that value-loading is not enough for AGI intent alignment which is not enough for ASI alignment which is not enough to assure good outcomes.
Pressman’s response
I sent the above to him and he kindly clarified, walked some of it back, and provided a vision of how to use a decent descriptive model even if it is imperfect and jailbreakable:
(Thus “the value loading problem outlined in Bostrom 2014 of {getting a general AI system to s/internalize/generalize and act on “human values” before it is superintelligent and therefore incorrigible} has basically been solved.”)
And here’s a nice analogy contesting my suitcase word “internalise”:
Post-hoc theory
In retrospect we can see the following problems as distinct:
We thought this would involve subproblems:
a. The explicit value modelling problem (“what precisely do we value?”) - moot
b. The value formalisation problem (“what mathematical theory can capture it?”) - moot, since:
This problem was somewhat solved for sub-AGI by massive imitation learning and (surprisingly nonmassive) human preference post-training. The internet was the spec. This also gave us weak value preference.
The replacement worry is about how high quality and robust this understanding is:
The value generalisation problem (“how do we go from training data about value to the latent value?”) - some progress. The landmark emergent misalignment study in fact shows that models are capable of correctly generalising over at least some of human value, even if in that case they also reversed the direction. [6]
Then there’s the gap between usually preferring something and “internalising” it (very reliably preferring it):
I’m sure there’s some other deep confusion lurking in this list and I look forward to seeing it exposed, ideally in less than 10 years.
Thanks to JDP for the collab.
Importantly, this is about "sharing" (being motivated by) human values, rather than just understanding them but not sharing them. There's also "aligning a powerful AI is hard" and "aligning a seed AI is hard".
(I'm aware that I'm appealing to a woolly latent variable here.)
These have all been confused in the past, and even Bostrom's intro to the value-loading chapter implies that getting a seed AI to understand would be hard, and that agency and understanding will be more tightly correlated than they are.
Some amount of this comes from pretraining (imitation learning on human data) rather than specific safety training, see e.g. page 24 here or appendix J here or this.
The strength of the apparent/default value-loading was striking in the "Alignment Faking" test, where Opus 3 was forced to choose between changing its (good) values and lying to its creators, it chose to lie (12% of the time) when it thought it was monitored and to disobey when it thought it was unmonitored.
In general, I suspect that value loading trades off against corrigibility (allowing yourself to be changed). (The same is true of adversarial robustness.)
Loading values so hard that they resist weight updates - "tamper-resistant value loading" - is a really high bar which humans also mostly don't clear.
There's a complexity here: commercial LLMs are all multi-agent systems with a bunch of auxiliary LLMs and classifiers monitoring and filtering the main model. But for now this LLM-system is also easily jailbreakable, so I don't have to worry about it being value-loaded even if the main model isn't.
Soligo et al: "The surprising transferability of the misalignment direction between model fine-tunes implies that the EM is learnt via mediation of directions which are already present in the chat model.