A single principle related to many Alignment subproblems?
I want to show a philosophical principle which, I believe, has implications for many alignment subproblems. If the principle is valid, it might allow to * connect the study of abstractions, Shard Theory and Mechanistic Anomaly Detection; * obtain multiple potential solutions to the Eliciting Latent Knowledge problem; * obtain a bounded solution to outer and inner alignment. (I mean Task-directed AGI level of outer alignment.) This post clarifies and expands on ideas from here and here. Reading the previous posts is not required. The Principle The principle and its most important consequences: 1. By default, humans only care about variables they could (in principle) easily optimize or comprehend.[1] While the true laws of physics can be arbitrarily complicated, the behavior of variables humans care about can't be arbitrarily complicated. 2. Easiness of optimization/comprehension can be captured by a few relatively simple mathematical properties (X).[2] Those properties can describe explicit and implicit predictions about the world. 3. We can split all variables (potentially relevant to human values) into partially arbitrary classes, based on how many X properties they have. The most optimizable/comprehensible variables (V1), less optimizable/comprehensible variables (V2), even less optimizable/comprehensible variables (V3), etc. We can do this without abrupt jumps in complexity or empty classes. The less optimizable/comprehensible the variables are, the more predictive power they might have (since they're less constrained). Justification: * If something is too hard to optimize/comprehend, people couldn't possibly optimize/comprehend it in the past, so it couldn't be a part of human values. * New human values are always based on old human values. If people start caring about something which is hard to optimize/comprehend, it's because that "something" is similar to things which are easier to optimize/comprehend.[3] Human values are recursive, in some se
I think he already came to some conclusions and you already gave some good references (which support some of the conclusions).
Do those methods have names or address problems which have names (like the Byzantine generals problem)?