"Existing methods that directly shape model motivations....most promising approach."
Very much agree: Anthropic's deliberative model if presumably based on "documents about itself" suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage bein... (read more)
Very much agree: Anthropic's deliberative model if presumably based on "documents about itself" suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage bein... (read more)