I don't follow. It seems like you agree that the human values we would like to have optimized is not a natural abstraction, and instead there is some different concept that would be learned instead, if attempting to learn human values as a natural abstraction. This seems like stumbling block, but then your conclusion immediately says that it isn't a stumbling block.
I'm kind of tipsy so maybe I'm missing something.
The most natural abstraction isn't any specific model of human values, but a minimal model that captures what they have in common.
The minimal model may be the model most agents performing unsupervised learning on human generated data learn.
Alternatively, most other models imply the minimal model.
Reply To: "Take 4: One problem with natural abstractions is there's too many of them."
Epistemic Status
Unconfident. I don't yet have a firm philosophical grounding on abstractions.
Unequally Made Abstractions
Not all natural abstractions are made equal. It seems to me that there are different degrees of "naturalness" of an abstraction. When we say a concept is a natural abstraction, we are saying things like:
It follows that the most natural abstractions for a concept (cluster) are the abstractions that we'd expect AI systems to actually use/the abstractions that are most relevant for their decision making.
What's The Most Natural Abstraction for Human Values?
Consider the subset of "human values" that we'd be "happy" (were we fully informed) for powerful systems to optimise for.
[Weaker version: "the subset of human values that it is existentially safe for powerful systems to optimise for".]
Let's call this subset "ideal values".
I'd guess that the "most natural" abstraction of values isn't "ideal values" themselves but something like "the minimal latents of ideal values".
Conclusions
I don't think the multitude of abstractions of human values is necessarily as big a stumbling block as Steiner posited.
"Learning our universe" means learning a map/world model of our universe that allows effectively predicting future events.
The constraint of efficiency suggests that the learned world model (and inferences via it) should have low data, time, and space complexity (relative to what is attainable for optimal learning algorithms).
When aggregating across learning algorithms we might want to give algorithms weights that are inversely proportional to how efficient they are in learning our universe.
When aggregating across representations of our universe we might want to give representations weights that are inversely proportional to how efficient they represent our universe.