Reply To: "Take 4: One problem with natural abstractions is there's too many of them."

## Epistemic Status

Unconfident. I don't yet have a firm philosophical grounding on abstractions.

# Unequally Made Abstractions

Not all natural abstractions are made equal. It seems to me that there are different degrees of "naturalness" of an abstraction. When we say a concept is a natural abstraction, we are saying things like:

- The concept is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
^{[1]}- The more privileged the concept is in aggregate
^{[2]}, the more natural the abstraction is- Learning algorithms given weight inversely proportional to how efficiently they learn our universe

- The more learning algorithms privilege the concept, the more natural an abstraction it is

- The more privileged the concept is in aggregate
- Most efficient representations of our universe contain a simple embedding of the subset
- [The simpler the embeddings of the concept are]/[the easier it is to point to embeddings of the concept] in aggregate
^{[3]}, the more natural the abstraction is

- [The simpler the embeddings of the concept are]/[the easier it is to point to embeddings of the concept] in aggregate

It follows that the most natural abstractions for a concept (cluster) are the abstractions that we'd expect AI systems to actually use/the abstractions that are most relevant for their decision making.

# What's The Most Natural Abstraction for Human Values?

Consider the subset of "human values" that we'd be "happy" (were we fully informed) for powerful systems to optimise for.

[Weaker version: "the subset of human values that it is existentially safe for powerful systems to optimise for".]

Let's call this subset "ideal values".

I'd guess that the "most natural" abstraction of values isn't "ideal values" themselves but something like "the minimal latents of ideal values".

# Conclusions

I don't think the multitude of abstractions of human values is necessarily as big a stumbling block as Steiner posited.

^{^}"Learning our universe" means learning a map/world model of our universe that allows effectively predicting future events.

The constraint of efficiency suggests that the learned world model (and inferences via it) should have low data, time, and space complexity (relative to what is attainable for optimal learning algorithms).

^{^}When aggregating across learning algorithms we might want to give algorithms weights that are inversely proportional to how efficient they are in learning our universe.

^{^}When aggregating across representations of our universe we might want to give representations weights that are inversely proportional to how efficient they represent our universe.

I don't follow. It seems like you agree that the human values we would like to have optimized is not a natural abstraction, and instead there is some different concept that would be learned instead, if attempting to learn human values as a natural abstraction. This seems like stumbling block, but then your conclusion immediately says that it isn't a stumbling block.

I'm kind of tipsy so maybe I'm missing something.

The most natural abstraction isn't any specific model of human values, but a minimal model that captures what they have in common.

What can one use this minimal model for?

The minimal model may be the model most agents performing unsupervised learning on human generated data learn.

Alternatively, most other models imply the minimal model.

This tells us how we get that model but not what one can use it for.

I think using such a model as an optimisation target would be existentially safe.