Clickbait: Normative value has high algorithmic complexity: there's no simple way to describe the goals we want AIs to want.

Summary: The proposition that there's no algorithmically simple object-level goal we can give to an advanced AI that yields a future of high value. Or: Any formally simple goal given to an AI, that talks directly about what sort of world to create, will produce disaster. Or: If you're trying to talk directly about what events or states of the world you want, then any sort of programmatically simple utility function, of the sort a programmer could reasonably hardcode, will lead to a bad end. Contrast, e.g., an induction rule that can learn very complicated classification rules from labeled instances, or a preference framework that explicitly models humans in order to learn complicated facts about what humans want.

Arguments.

We can see Complexity of Value as being implied by three subpropositions:

The Intrinsic Complexity of Value proposition is that the properties we want AIs to achieve - whatever stands in for the metasyntactic variable 'value' - have a large amount of intrinsic information in the sense of comprising a large number of independent facts that aren't being generated by a single computationally simple rule.

A very bad example that may nonetheless provide an important intuition is to imagine trying to pinpoint to an AI what constitutes 'worthwhile happiness'. The AI suggests a universe tiled with tiny Q-learning algorithms receiving high rewards. After some explanation and several labeled datasets later, the AI suggests a human brain with a wire stuck into its pleasure center. After further explanation, the AI suggests a human in a holodeck. You begin talking about the importance of believing truly and that your values call for apparent human relationships to be real relationships rather than being hallucinated. The AI asks you what constitutes a good human relationship to be happy about. The series of questions occurs because (arguendo) the AI keeps running into questions whose answers are not AI-obvious from the previous answers already given, because they involve new things you want such that your desire of them wasn't obvious from answers you'd already given. The upshot is that the specification of 'worthwhile happiness' involves a long series of facts that aren't reducible just to the previous facts, and some of your preferences may involve many fine details of surprising importance. In other words, the specification of 'worthwhile happiness' would be at least as hard to code by hand into the AI as it would be difficult to hand-code a formal rule that could recognize which pictures contained cats. (I.e., impossible.)

The second proposition is Incompressibility of Value which says that attempts to reduce these complex values into some incredibly simple and elegant principle fail (much like early attempts by e.g. Bentham to reduce all human value to pleasure); and that no simple instruction given an AI will happen to target outcomes of high value either. The core reason to expect a priori that all such attempts will fail, is that most 1000-byte strings aren't compressible down to some incredibly simple pattern no matter how many clever tricks you try to throw at them; fewer than 1 in 1024 such strings can be compressible to 990 bytes, never mind 10 bytes. Due to the tremendous number of different proposals for why some simple instruction to an AI should end up achieving high-value outcomes or why all human value can be reduced to some simple principle, there is no central demonstration that all these proposals must fail, but there is a sense in which a priori we should strongly expect all such clever attempts to fail.

The third proposition is Fragility of Value which says that if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly. E.g. an AI with only 950 bytes of the full definition may end up creating 5%, 0%, or negative value rather than 95% of the value. (E.g., the AI understood all aspects of what makes for a life well-lived... except the part about requiring a conscious observer to experience it.)

Together, these propositions would imply that to achieve an adequate amount of value (e.g. 90% of potential value, or even 20% of potential value) there may be no simple handcoded object-level goal for the AI that results in that value's realization. E.g., you can't just tell it to 'maximize happiness', with some hand-coded rule for identifying happiness.

Centrality.

Complexity of Value is an extremely central proposition in value alignment theory. Many anticipated difficulties revolve around it:

Complex values can't be hand-coded into an AI, and require value learning or Do What I Mean preference frameworks.

Complex /fragile values may be hard to learn even by induction because the labeled data may not include distinctions that give all of the 1000 bytes a chance to cast an unambiguous causal shadow into the data, and it's very bad if 50 bytes are left ambiguous.

Complex / fragile values require error-recovery mechanisms because of the worry about getting some single subtle part wrong and this being catastrophic. (And since we're working inside of highly intelligent agents, the mechanism has to be a corrigible preference so that the agent goes along with it.)

More generally:

Complex values tend to be implicated in patch-resistant problems that wouldn't be resistant if there was some obvious 5-line specification of exactly what to do, or not do.

Complex values tend to be implicated in the context change problems that wouldn't exist if we had a 5-line specification that solved those problems once and for all and that we'd likely run across during the development phase.

Importance.

Many policy questions strongly depend on Complexity of Value, mostly having to do with the overall difficulty of developing value-aligned AI, e.g.:

Should we try to develop Genies or Sovereigns?

How likely is a moderately safety-aware project to succeed?

Should we be more worried about malicious actors creating AI, or about well-intentioned errors?

How difficult is the total problem and how much should we be panicking?

How attractive would be any genuinely credible game-changing alternative to AI?

It has been advocated that there are psychological biases and popular mistakes leading to beliefs that directly or by implication deny Complex Value. To the extent one credits that Complex Value is probably true, one should arguably be concerned about the number of early assessments of the value alignment problem that seem to rely on Complex Value being false (like just needing to hardcode a particular goal into the AI, or in general treating the value alignment problem as not panic-worthily difficult).

Truth condition

The Complexity of Value proposition is true if, relative to viable and acceptable real-world Methodologies for AI development, there isn't any reliably knowable way to specify the AI's object-level preferences as a structure of low algorithmic complexity, such that the result of running that AI is achieving Enough of the possible value, for reasonable and humanly interpersonally persuasive definitions of value.

The caveats above are spelled out below.

Viable and acceptable computation.

Suppose there turns out to exist, in principle, a relatively simple Turing machine (e.g. 100 states) that picks out 'value' by re-running entire evolutionary histories, creating and discarding a hundred billion sapient races in order to pick out one that ended up relatively similar to human. This would both be using a very large amount of computing power and also committing an unacceptable amount of mindcrime.

Todo

Clickbait: Normative value has high algorithmic complexity: there's no simple way to describe the goals we want AIs to want.

Arguments.

We can see Complexity of Value as being implied by three subpropositions:

Centrality.

Complexity of Value is an extremely central proposition in value alignment theory. Many anticipated difficulties revolve around it:

Complex values can't be hand-coded into an AI, and require value learning or Do What I Mean preference frameworks.

Complex /fragile values may be hard to learn even by induction because the labeled data may not include distinctions that give all of the 1000 bytes a chance to cast an unambiguous causal shadow into the data, and it's very bad if 50 bytes are left ambiguous.

Complex / fragile values require error-recovery mechanisms because of the worry about getting some single subtle part wrong and this being catastrophic. (And since we're working inside of highly intelligent agents, the mechanism has to be a corrigible preference so that the agent goes along with it.)

More generally:

Complex values tend to be implicated in patch-resistant problems that wouldn't be resistant if there was some obvious 5-line specification of exactly what to do, or not do.

Complex values tend to be implicated in the context change problems that wouldn't exist if we had a 5-line specification that solved those problems once and for all and that we'd likely run across during the development phase.

Importance.

Many policy questions strongly depend on Complexity of Value, mostly having to do with the overall difficulty of developing value-aligned AI, e.g.:

Should we try to develop Genies or Sovereigns?

How likely is a moderately safety-aware project to succeed?

Should we be more worried about malicious actors creating AI, or about well-intentioned errors?

How difficult is the total problem and how much should we be panicking?

How attractive would be any genuinely credible game-changing alternative to AI?

Truth condition

The caveats above are spelled out below.

Viable and acceptable computation.

Todo

LESSWRONG
LW

LESSWRONG
LW

Complexity of value

Arguments.

Centrality.

Importance.

Truth condition

Viable and acceptable computation.

Complexity of value

Arguments.

Centrality.

Importance.

Truth condition

Viable and acceptable computation.