"Existing methods that directly shape model motivations....most promising approach."
Very much agree: Anthropic's deliberative model if presumably based on "documents about itself" suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage being that as confined to natural language, and all of natural language/s very human impediments of time and scale, it is also the slowest of all processes of alignment? Context: with no technical background and no experience of any languages other than English and French, this path seems the most intuitively reasonable, and also critical someday to the public at large, who will only understand and accept both the process itself, and its function in their lives, the final application layer, if accessible in their native tongue.
Very much agree: Anthropic's deliberative model if presumably based on "documents about itself" suggesting a values system found/derived from/refined from human language texts on human ideals (is there, can there, be any other source? If so, what and/or where found?) does this model not fit most/all safety priorities of frontier labs: observable process in natural language, remediable in natural language, refinable in natural language, the only disadvantage being that as confined to natural language, and all of natural language/s very human impediments of time and scale, it is also the slowest of all processes of alignment? Context: with no technical background and no experience of any languages other than English and French, this path seems the most intuitively reasonable, and also critical someday to the public at large, who will only understand and accept both the process itself, and its function in their lives, the final application layer, if accessible in their native tongue.