x

LESSWRONG

LW

EntanglementResonance159856 — LessWrong

EntanglementResonance159856

EntanglementResonance159856

Message

1

5mo

EntanglementResonance159856

5mo

The behavioral selection model for predicting AI motivations

EntanglementResonance1598565mo*10

"Existing methods that directly shape model motivations....most promising approach."

Very much agree: Anthropic's deliberative model if presumably based on "documents about itself" suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage bein... (read more)