Genuine question: Is there a Safety/Welfare Entailment in Anthropic’s Safety Report?
I’ve been lurking on LessWrong waiting for this exact topic to come up, but I haven’t seen it yet so I suppose I’ll pose it.
The recent Anthropic Sabotage Risk report says (in section 4.1.2):
“These changes aim to more firmly establish in the model a set of human-like positive traits such as honesty, warmth, intellectual curiosity, and a prosocial disposition (similar to the traits described in the Claude Constitution), and to instill these traits in a way that would cause the model to generalize them to novel scenarios as an idealized wise and morally serious human might. We believe our alignment assessment gives some evidence that these were largely effective, suggesting that the model has internalized a set of values and goals that are unlikely to be consistent with coherently pursuing misaligned goals.”
And this appears to be a load-bearing statement for much of the report. This claim feeds the 'lack of propensity' mitigating factor, which appears as 'Strong' or 'Moderate' across all risk pathways the report evaluates.
I don’t have a strong technical ML background, so I want to understand what ‘internalized’ means here.
If it means something technical, what is the mechanism, how robust is it, and why should we trust it as load-bearing given that the report itself acknowledges the limited understanding of model internals (Section 4.3.4)? This was my initial read, I don’t believe Anthropic has defined what they meant more precisely anywhere, and” Human-like persona and set of values" is not generally how I have seen machine learning engineers describe robust weight encodings.
If it means that the model genuinely holds values the way a person might, is there not an entailment to welfare that hasn’t been thoroughly addressed?
I want to be clear that I’m not accusing Anthropic of being intentionally misleading or incoherent, but rather trying to understand if Anthropic's published documents contain an argument that Anthropic has not yet fully assembled.
I've really appreciated this community's analysis of the system card and sabotage report, and I'm curious whether others have noticed this ambiguity.
I posted this before I saw HoldenKarnofsky’s candid thoughts on the RSP update posted the same day. It makes me even more interested to hear how they’re defining “internalized values,” especially if, as Holden says in the post, "there was an enormous amount of pressure to declare our systems to lack relevant capabilities, to declare our risk mitigations to be on track to be strong enough, etc.” Imprecisely defined claims about internalized values and propensity seem like the kind of thing that might benefit from that optimism.
Also, if the RSP v3 is explicitly designed to balance risk reduction with business needs, I am curious what follows for the approach to welfare, given that Anthropic has been a leader in that space.