I want to show a philosophical principle which, I believe, has implications for many alignment subproblems. If the principle is valid, it might allow to

This post clarifies and expands on ideas from here and here. Reading the previous posts is not required.

The Principle

The principle and its most important consequences:

  1. By default, humans only care about variables they could (in principle) easily optimize or comprehend.[1] While the true laws of physics can be arbitrarily complicated, the behavior of variables humans care about can't be arbitrarily complicated.
  2. Easiness of optimization/comprehension can be captured by a few relatively simple mathematical properties ().[2] Those properties can describe explicit and implicit predictions about the world.
  3. We can split all variables (potentially relevant to human values) into partially arbitrary classes, based on how many  properties they have. The most optimizable/comprehensible variables (), less optimizable/comprehensible variables (), even less optimizable/comprehensible variables (), etc. We can do this without abrupt jumps in complexity or empty classes. The less optimizable/comprehensible the variables are, the more predictive power they might have (since they're less constrained).

Justification:

  • If something is too hard to optimize/comprehend, people couldn't possibly optimize/comprehend it in the past, so it couldn't be a part of human values.
  • New human values are always based on old human values. If people start caring about something which is hard to optimize/comprehend, it's because that "something" is similar to things which are easier to optimize/comprehend.[3] Human values are recursive, in some sense, flowing from simpler (versions of) values to more complicated (versions of) values. This can be seen from pure introspection, without reference to human history.
  • Therefore, if something is hard to optimize/comprehend, it's unlikely to be a part of the current human values (unless it's connected to something simpler). Even if currently humans have the means to optimize/comprehend it.

There are value systems for which the principle is false. In that sense, it's empirical. However, I argue that it's a priori true for humans, no matter how wrong our beliefs about the world are. So the principle is not supposed to be an "assumption" or "hypothesis", like e.g. the Natural Abstraction hypothesis.

You can find a more detailed explanation of the principle in the appendix.

Formalization

How do we define easiness of comprehension? We choose variables describing our sensory data. We choose what properties () of those variables count as "easily comprehensible". Now when we consider any variable (observable or latent), we check how much its behavior fits the properties. We can order all variables from the most comprehensible to the least comprehensible ().

Let's give a specific example of  properties. Imagine a green ball in your visual field. What properties would make this stimulus easier to comprehend? Continuous movement (the ball doesn't teleport from place to place), smooth movement (the ball doesn't abruptly change direction), low speed (the ball doesn't change too fast compared to other stimuli), low numerosity (the ball doesn't have countless distinct parts). That's the kind of properties we need to abstract and capture when defining .

How do we define easiness of optimization? Some  variables are variables describing actions (), ordered from the most comprehensible to the least comprehensible actions (). We can check if changes in an  variable are correlated with changes in a  variable. If yes,  can optimize . Easiness of optimization is given by the index of . This is an incomplete definition, but it conveys the main idea.

Formalizing all of this precisely won't be trivial. But I'll give intuition pumps for why it's a very general idea which doesn't require getting anything exactly right on the first try.

Example: our universe

Consider these things:

  1. Quantum entanglements between atoms of diamonds and atoms of other objects.
  2. Individual atoms inside of diamonds.
  3. Clouds of atoms which we call "diamonds".
  4. Individual photons bouncing off of diamonds.
  5. Images of diamonds.

According to the principle:

  • People are most likely to care about 5 or 3. Because those things are the easiest to optimize/comprehend.
  • People are less likely to care about 2. Individual atoms are very small, so they're harder to interact with.
  • People are even less likely to care about 4. Those things are very fast (relative to the speed at which humans do things), so they're harder to interact with and comprehend. However, ignoring speed, photons are pretty comprehensible (unless we're venturing into quantum mechanics).
  • People are the least likely to care about 1. Quantum entanglements are the hardest to optimize/comprehend. The logic of the quantum world is too different from the logic of the macro world.

This makes sense. Why would we rebind our utility to something which we couldn't meaningfully interact with, perceive or understand previously?

Example: Super Mario Bros.

Let's see how the principle applies to a universe very different from our own. A universe called Super Mario Bros.

When playing the game, it's natural to ask: what variables (observable or latent) change in ways which are the easiest to comprehend? Which of those changes are correlated with simple actions of the playable character or with simple inputs?

Let's compare a couple of things from the game:

  1. The variable corresponding to Mario's position (physical or visual).
  2. The variable which connects a pixel-precise event in one place to the subpixel position of an object in a completely different place. See this video explanation, from 12:05.
  3. The current value of the game's pseudorandom number generator.

According to the principle:

  • We're most likely to care about 1. It changes continuously, smoothly and slowly, which makes it very easy to comprehend. It's also correlated with simple inputs in a simple way, which makes it easy to optimize.
  • We're less likely to care about 2. It involves "spooky action at distance" and deals with unusually precise measurements, so it's harder to comprehend. It's also harder to optimize, since it requires a pixel-precise action.
  • We're the least likely to care about 3. It changes discontinuously and very fast, so it's hard to comprehend. Also, it's not correlated with most of the player's input.[4]

This makes sense. If you care about playing the game, it's hard to care about things which are tangential or detrimental to the main gameplay.

Example: reverse-engineering programs with a memory scanner

There's a fun and simple way to hack computer programs, based on searching and filtering variables stored in a program's memory.

For example, do you want to get infinite lives in a video game? Then do this:

  • Take all variables which the game stored in the memory.
  • Lose a life. Filter out all variables which haven't decreased.
  • Lose another life. Filter out all variables which haven't decreased.
  • Gain a life. Filter out all variables which haven't increased.
  • Don't lose or gain a life. Filter out all variables which changed.
  • And so on, until you're left with a small amount of variables.

Oftentimes you'll end up with at least two variables: one controlling the actual number of lives and the other controlling the number of lives displayed on the screen. Here's a couple of tutorial videos about this type of hacking: Cheat Engine for Idiots, Unlocking the Secrets of my Favorite Childhood Game.

It's a very general approach to reverse engineering a program. And the idea behind my principle is that the variables humans care about can be discovered in a similar way, by filtering out all variables which don't change according to certain simple rules.

If you still struggle to understand what "easiness of optimization/comprehension" means, check out additional examples in the appendix

Philosophy: Anti-Copernican revolutions

(This is a vague philosophical point intended to explain what kind of "move" I'm trying to make by introducing my principle.)

There are Copernican revolutions and Anti-Copernican revolutions.

Copernican revolutions say "external things matter more than our perspective". The actual Copernican revolution is an example.

Anti-Copernican revolutions say "our perspective matters more than external things". The anthropic principle is an example: instead of asking "why are we lucky to have this universe?" we ask "why is this universe lucky to have us?". What Immanuel Kant called his "Copernican revolution" is another example: instead of saying "mental representations should conform to external objects" he said "external objects should conform to mental representations".[5] Arguably, Policy Alignment is also an example ("human beliefs, even if flawed, are more important than AI's galaxy-brained beliefs").

With my principle, I'm trying to make an Anti-Copernican revolution too. My observation is the following: for our abstractions to be grounded in anything at all, reality has to have certain properties — therefore, we can deduce properties of reality from introspective information about our abstractions.

Visual illustration

The green bubble is all aspects of reality humans can optimize or comprehend. It's a cradle of simplicity in a potentially infinite sea of complexity. The core of the bubble is , the outer layer is .

The outer layer contains, among other things, the last theory of physics which has some intuitive sense. The rest of the universe, not captured by the theory, is basically just "noise".

We care about the internal structure of the bubble (its internals are humanly comprehensible concepts). We don't care about the internal structure of the "noise". Though we do care about predicting the noise, since the noise might accidentally accumulate into a catastrophic event.

The bubble has a couple of nice properties. It's humanly comprehensible and it has a gradual progression from easier concepts to harder concepts (just like in school). We know that the bubble exists, no matter how wrong our beliefs are. Because if it doesn't exist, then all our values are incoherent and the world is incomprehensible or uncontrollable.

Note that the bubble model applies to 5 somewhat independent things: the laws of physics, ethics, cognition & natural language, conscious experience, and mathematics.

Philosophy: a new simplicity measure

Idea 1. Is "objects that are easy to manipulate with the hand" a natural abstraction? I don't know. But imagine I build an AI with a mechanical hand. Now it should be a natural abstraction for the AI, because "manipulating objects with the hand" is one of the simplest actions the AI can perform. This suggests that it would be nice to have an AI which interprets reality in terms of the simplest actions it can take. Because it would allow us to build a common ontology between humans and the AI.

Idea 2. The simplest explanation of the reward is often unsafe because it's "too smart". If you teach a dumber AI to recognize dogs, it might learn the shape and texture of a dog; meanwhile a superintelligent AI will learn a detailed model of the training process and Goodhart it. This suggests that it would be nice to have an AI which doesn't just search for the simplest explanation with all of its intelligence, but looks for the simplest explanations at different levels of intelligence — and is biased towards "simpler and dumber" explanations.

The principle combines both of those ideas and gives them additional justification. It's a new measure of simplicity.


Human Abstractions

Here I explain how the principle relates to the following problems: the pointers problem; the diamond maximizer problem; environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; eliciting latent knowledge.

According to the principle, we can order all variables by how easy they are to optimize/comprehend (). We can do this without abrupt jumps in complexity or empty classes.  can have greater predictive power than , because it has less constraints.

That implies the following:

  1. We can search for a world-model consisting of  variables. Then search for a world-model consisting of  variables and having greater predictive power. Then search for a world-model consisting of  variables and having even greater predictive power. Etc.
  2. As a result, we get a sequence of easily interpretable models which model the world on multiple levels. We can use it to make AI care about specific physical objects humans care about (e.g. diamonds). We can even automate this process, to an extent.
  3. There might be aspects of our universe not described by . Those aspects aren't really relevant to defining what we care about. Those aspects are basically just pseudorandom "noise" which feeds into . We need a good model of that noise, but we don't care about its internals anymore.[6]

The natural abstraction hypothesis says that (...) a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world. (from Testing The Natural Abstraction Hypothesis: Project Intro)

This claim says that this ability to learn natural abstractions applies more broadly: general-purpose cognitive systems (like humans or AGI) can in principle learn all natural abstractions. (...) This claim says that humans and ML models are part of the large class of cognitive systems that learn to use natural abstractions. Note that there is no claim to the converse: not all natural abstractions are used by humans. But given claim 1c, once we do encounter the thing described by some natural abstraction we currently don't use, we will pick up that natural abstraction too, unless it is too complex for our brain. (from Natural Abstractions: Key claims, Theorems, and Critiques)

If NAH is true, referents of human concepts have relatively simple definitions.

However, my principle implies that referents of human concepts have a relatively simple definition even if human concepts are not universal (i.e. it's not true that "a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world").

One article by Eliezer Yudkowsky kinda implies that there could be a language for describing any possible universe on multiple levels, a language in which defining basic human goals would be pretty easy (no matter what kind of universe humans live in):

Given some transparent prior, there would exist a further problem of how to actually bind a preference framework to that prior. One possible contributing method for pinpointing an environmental property could be if we understand the prior well enough to understand what the described object ought to look like — the equivalent of being able to search for ‘things W made of six smaller things X near six smaller things Y and six smaller things Z, that are bound by shared Xs to four similar things W in a tetrahedral structure’ in order to identify carbon atoms and diamond. (from Ontology identification problem: Matching environmental categories to descriptive constraints)

But why would such a language be feasible to figure out? It seems like creating it could require considering countless possible universes.

My principle explains "why" and proposes a relatively feasible method of creating it.

The predictor might internally represent the world in such a way that the underlying state of the world is not a continuous function of its activations. For example, the predictor might describe the world by a set of sentences, for which syntactically small changes (like inserting the word “not”) could correspond to big changes in the underlying state of the world. When the predictor has this structure, the direct translator is highly discontinuous and it is easy for human simulators to be closer to continuous.

We might try to fix this by asking the predictor to learn a “more continuous” representation, e.g. a representation such that observations are a continuous function or such that time evolution is continuous. One problem is that it’s unclear whether such a continuous parametrization even exists in general. But a more straightforward problem is that when evaluated quantitatively these approaches don’t seem to address the problem, because the properties we might try to use to enforce continuity can themselves be discontinuous functions of the underlying latent state. (from ELK prize results, Counterexample: the predictor’s latent space may not be continuous)

The principle could be used to prove that the properties for enforcing continuity can't themselves be discontinuous functions of the underlying latent state (unless something really weird is going on, in which case humans should be alerted). If we use  properties to define "continuity".

A proposal by Derek Shiller, Beth Barnes and Nate Thomas, Oam Patel:

Rather than trying to learn a reporter for a complex and alien predictor, we could learn a sequence of gradually more complex predictors  with corresponding reporters . Then instead of encouraging  to be simple, we can encourage the difference between  and  to be simple.

(...) Intuitively, the main problem with this proposal is that there might be multiple fundamentally different ways to predict the world, and that we can’t force the reporter to change continuously across those boundaries. (from ELK prize results, Strategy: train a sequence of reporters for successively more powerful predictors)

The principle could be used to prove that we can force predictors to not be "fundamentally different" from each other, so we can force the reporter to change continuously.


Low Impact

Here I explain how the principle relates to Impact Regularization.

Allowed consequences

When we say “paint all cars pink” or “cure cancer” there’s some implicit set of consequences that we think are allowable and should definitely not be prevented, such as people noticing that their cars are pink, or planetary death rates dropping. We don’t want the AI trying to obscure people’s vision so they can’t notice the car is pink, and we don’t want the AI killing a corresponding number of people to level the planetary death rate. We don’t want these bad offsetting actions which would avert the consequences that were the point of the plan in the first place. (from Low impact: Allowed consequences vs. offset actions)

We can order all variables by how easy they are to optimize/comprehend (). We could use it to differentiate between "impacts explainable by more coarse-grained variables ()" and "impacts explainable by more fine-grained variables ()". According to the principle, the latter impacts are undesirable by default. For example:

  • It's OK to make cars pink by using paint ("spots of paint" is an easier to optimize/comprehend variable). It's not OK to make cars pink by manipulating individual water droplets in the air to create an elaborate rainbow-like illusion ("individual water droplets" is a harder to optimize/comprehend variable).
  • It's OK if a person's mental state changes because they notice a pink car ("human object recognition" is an easier to optimize/comprehend process). It's not OK if a person's mental state changes because the pink car has weird subliminal effects on the human psyche ("weird subliminal effects on the human psyche" is a harder to optimize/comprehend process).
  • It's OK to minimize the amount of paint drops while executing the task ("paint drops" is an easier to optimize/comprehend variable). It's not OK to minimize how the execution of the task affects individual atoms ("individual atoms" is a harder to optimize/comprehend variable).

Sources of impact

Some hard to optimize/comprehend variables () are "contained" within easy to optimize/comprehend variables (). For example:

  • It's hard to comprehend the decision process of a human chess player. However, the way a human interacts with chess pieces is easy to comprehend.
  • The exact effects of the common flu on the human body are relatively hard to comprehend, because the human body is very complex. However, understanding how the virus spreads (an infected person breathing, talking, coughing, or sneezing near you) and diagnosing it is much easier.

We could use this fact to search for unusual sources of complicated impacts.


Goodhart's Curse

According to the principle, we can order all variables by how easy they are to optimize/comprehend (). If we could map the variables inside an AI to this order, we could measure how much optimization the AI does. Which is relevant to the Goodhart problem.

The intuition here is that to become substantially better at optimizing something, you need to perform progressively more complicated actions and take into account progressively more incomprehensible considerations; you can become better by just trying different combinations of the same kind of actions/considerations (), but then you're fundamentally limited. However, there's also a stronger claim:

An agent exploiting a certain kind of variables () can't game an outer objective, unless something starts exploiting less optimizable/comprehensible variables () — which the agent can notice or stop. Assuming the outer objective is chosen with minimal competence and the agent's environment is somewhat stable.

Example: collecting stones

Imagine a superintelligent robot tasked to place stones into a pile. A couple of cameras observe the robot. "Being seen collecting stones into a pile" is the outer objective. "Actually collecting stones into a pile" is the intent behind the outer objective.

The robot could get reward by going around and grabbing stones (the robot's and stones' movement can be described by  variables). The robot could also hack the cameras to produce fake video, but that would require manipulating variables which are substantially harder to optimize/comprehend (). If the robot sticks to manipulating , everything should be fine.

Another agent could hack the cameras. Less likely, the environment itself could transform into a state where the cameras are hacked. But any of that would imply that  variables have changed in a way directly related to how  variables can optimize the outer objective. (If cameras are hacked, collecting real stones suddenly becomes completely useless for optimizing the outer objective.) The robot can report or stop that.

In some part of the environment getting stones could be as hard as hacking the cameras. For example, maybe we need to hack the enemy's cameras to steal their stones. In such case we could whitelist exploiting  variables there. The robot can ensure that  optimization doesn't "spill" into the rest of the world.

Example: Speedrunning

Imagine you measure "how good one can play a video game" (the intent) by "the speed of completing the game" (the outer objective).

This outer objective can be Goodharted with glitches (anomalously fast movement, teleportation, invincibility, getting score points out of nothing). However, some of the below will be true:

  • The glitches violate normal properties of the game's variables (). By giving the player abnormal speed, for example.
  • Exploiting those glitches requires the player to do actions which can only be explained by more complicated variables (). Hitting a wall multiple times to trigger a glitch, for example (you predict this action to be a useless waste of time, unless you model the game with  variables).
  • If you vary how the player optimizes  variables, the outcome changes in a way not explainable by  variables. Because there's a specific sequence of individually innocent actions which triggers a glitch accidentally.

If the player sticks to , Goodharting the outer objective is impossible. But expert performance is still possible.

An agent which desperately and monomaniacally wants to optimize the mathematical (plan/state/trajectory)  (evaluation) "grader" function is not aligned to the goals we had in mind when specifying/training the grader (e.g. "make diamonds"), the agent is aligned to the evaluations of the grader (e.g. "a smart person's best guess as to how many diamonds a plan leads to").

I believe the point of "Don't align agents to evaluations of plans" can be reformulated as:

Make agents terminally value easy to optimize/comprehend variables (), so they won't Goodhart by manipulating hard to optimize/comprehend variables ().

My principle supports this point.

More broadly, a big aspect of Shard Theory can be reformulated as:

Early in training, Reinforcement Learning agents learn to terminally value easy to optimize/comprehend variables ("shards" are simple computations about simple variables)... that's why they're unlikely to Goodhart their own values by manipulating hard to optimize/comprehend variables.

If Shard Theory is true, the principle should give insight into how shards behave in all RL agents. Because the principle is true for all agents whose intelligence & values develop gradually and who don't completely abandon their past values.

See glider example, strawberry example, Boolean circuit example, diamond example.

I believe the idea of Mechanistic Anomaly Detection can be described like this:

Any model () has "layers of structure" and therefore can be split into versions, ordered from versions with less structure to versions with more structure (). When we find the version with the least structure which explains the most instances[7] of a phenomenon we care about (), it defines the latent variables we care about.

This is very similar to the principle, but more ambitious (makes stronger claims about all possible models) and more abstract (doesn't leverage even the most basic properties of human values).


Interpretability

Claim A. Say you can comprehend  variables, but not . You still can understand what  variable is the most similar to a  variable (and if the former causes the latter); if a change of a  variable harms or helps your values (and if the change is necessary or unnecessary); if a  variable is contained within a particular part of your world-model or not. According to the principle, this knowledge can be obtained automatically.

Claim B. Take an  ontology (which describes real things) and a simpler  ontology (which might describe nonexistent things). Whatever  ontology describes, we can automatically check if there's anything in  that corresponds to it OR if searching for a correspondence is too costly.

This is relevant to interpretability.

Example: Health

Imagine you can't comprehend how the human body works. Сonsider those statements by your doctor:

  • "<Something incomprehensible> can kill you."
  • "Changing <something incomprehensible> is necessary to avoid dying."
  • "<Something incomprehensible> is where your body is."
  • "<Something incomprehensible> can destroy your <something incomprehensible>. And the latter is the most similar thing to 'you' and it causes 'you'."
  • "I want to change <something incomprehensible> in your body. I didn't consider if it's necessary."

Despite not understanding it all, you understand everything relevant to your values. For example, from the last statement you understand that the doctor doesn't respect your values.

Now, imagine you're the doctor. You have a very uneducated patient. The patient might say stuff like "inside my body <something> moves from one of my hands to another" or "inside my body <something> keeps expanding below my chest". Whatever they'll describe, you'll know if you know any scientific explanation of that OR if searching for an explanation is too costly.

The above is similar to Ramsification.

Hypothetical ELK proposal

Claims A and B suggest a hypothetical interpretability method. I'll describe it with a metaphor:

  1. Take Einstein. Make a clone of Einstein. Simplify the clone's brain a bit. As a result, you might get a crazy person (a person believing in nonexistent things). Alternatively, you might get a "bizarre" person (a person believing in things which are too costly for Einstein to verify). According to claim B, we can automatically check if any of that happened. If it did happen, we discard the clone and make another one. According to claim A, we can automatically check if Einstein is deceiving his dumber clone. Though claims A and B only apply if we can translate brains into models made of  variables.
  2. Repeat the previous step recursively, until you end up with a chain of clones from Einstein to a village idiot. Inside that chain, Einstein can't deceive the village idiot.

In this metaphor, Einstein = an incomprehensible AI. Village idiot = an easily interpretable AI. It's like the broken telephone game, except we're fixing broken links.

If some assumptions hold (it's cheap to translate brains into models made of  variables; describing Einstein's cognition doesn't require variables more complex than ; producing a non-insane non-bizarre clone doesn't take forever), the proposal above gives a solution to ELK.


Building AGI

Consider this:

  • According to the principle, we can order all variables by how easy they are to optimize/comprehend (). Those variables can be turned into AI models ().  is made out of  variables.  is made out of  variables. And so on. Each model can be bounded. The progression of models is "continuous",  is similar to .
  •  has at least human-level intelligence. It can reason about any concept humans find minimally intuitive.
  • For models with low indices, we can verify their inner and outer alignment.
  • The condition of a model's outer alignment is simple: roughly speaking, it shouldn't want to maximize reward by optimizing  or more complicated variables. As explained here.
  • We can ensure, in an automated way, that  can't deceive . As explained here.

If my principle is formalized, we might obtain a bounded solution to outer and inner alignment. (I mean Task-directed AGI level of outer alignment.) Not saying the procedure is gonna be practical.


Appendix

Comparing variables

Here are some additional examples of comparing variables based on their  properties.

Consider what's easier to optimize/comprehend:

  1. a chair or a fly?
  2. a chair or the Empire State Building?
  3. a train or a dog?
  4. anger or hunger?
  5. anger or love?

Here are the answers:

  1. The chair. It's slower and bigger than the fly. Size and speed make the chair easier to track or interact with.
  2. The chair. The Empire State Building is too big and heavy, so it's harder to interact with or observe in detail.
  3. Some aspects of the train are harder to optimize/comprehend (it's very fast and too big). Other aspects of the train are easier to optimize/comprehend (the train doesn't have a brain and runs on a track, so its movement is fundamentally more smooth and predictable; the train's body is fundamentally simpler than the dog's body).
  4. Hunger. It's easier to switch hunger "on" and "off" with simple actions. Anger also leads to more complicated, more varied actions (because it's a more personal, more specific feeling). However, some aspects of anger and hunger can be equally hard to optimize/comprehend.
  5. Anger. It's easier to trigger and increase anger with simple actions. However, some aspects of anger and love can be equally hard to optimize/comprehend.

The analysis above is made from the human perspective and considers "normal" situations (e.g. the situations from the training data, like in MAD).

Inscrutable brain machinery

Some value judgements (e.g. "this movie is good", "this song is beautiful", or even "this is a conscious being") depend on inscrutable brain machinery, the machinery which creates experience. This contradicts the idea that "easiness of optimization/comprehension can be captured by a few relatively simple mathematical properties". But I think this contradiction is not fatal, for the following reason:

We aren't particularly good at remembering exact experiences, we like very different experiences, we can't access each other's experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable brain machinery is required for respecting the current human values. Therefore, defining corrigible behavior ("don't kill everyone", "don't seek power", "don't mess with human brains") shouldn't require answering many specific, complicated machinery-dependent questions ("what separates good and bad movies?", "what separates good and bad life?", "what separates conscious and unconscious beings?").

A more detailed explanation of the claims behind the principle

Imagine you have a set of some properties (). Let's say it's just one property, "speed". You can assign speed to any "simple" variable (observable or latent). However, you can combine multiple simple variables, with drastically different speeds, into a single "complex" variable. As a result, you get a potentially infinite-dimensional space  (a complex variable made out of  simple variables is an -dimensional object). You reduce the high-dimensional space into a low-dimensional space . For simplicity, let's say it's a one-dimensional space. Inevitably, such reduction requires making arbitrary choices and losing information.

Here are the most important consequences of the principle, along with some additional justifications:

  1. Easiness of optimization/comprehension can be captured by a few relatively simple mathematical properties (let's redefine  as the set of those properties). Those properties can describe explicit and implicit predictions about the world. Justification: there are pretty simple factors which make something harder to optimize or comprehend for humans. Therefore, there should be a relatively simple metric.
  2. High-dimensional  can be reduced to low-dimensional  without losing too much information. Justification: introspectively, human ability to comprehend and human action space don't appear very high-dimensional, in the relevant sense. Therefore, human values can only be based on variables for which (2) is true.
  3.  has no large gaps, any variable  ("i" is a coordinate) has another variable close enough. Justification: same as the justification of the principle.
  4.  is "small", in some sense. Variables have a lot of shared structure and/or not all of them are equally important to human values. Justification: if this weren't true, it would be impossible to reflect on your values or try to optimize them holistically. Your values would feel like an endless list of obscure rules you can never remember and follow. By the way, this is another justification for (2).
  5. You can isolate, in some sense, the optimization of more complex variables from the optimization of simpler variables. Justification: if this weren't true, human technology would be incompatible with preserving any pre-technological values.

There are value systems for which claims 1-5 aren't true. In that sense, they're empirical. However, I argue that the claims are a priori true for humans, no matter how wrong our beliefs about the world are.

Some June edits, before 11/06/25: added a little bit of content and made a couple of small edits.
 

  1. ^

    "Could" is important here. You can optimize/comprehend a thing (in principle) even if you aren't aware of its existence. For example: cave people could easily optimize "the amount of stone knives made of quantum waves" without knowing what quantum waves are; you could in principle easily comprehend typical behavior of red-lipped batfishes even if you never decide to actually do it.

  2. ^

    An important objection to this claim involves inscrutable brain machinery.

  3. ^

    For example, I care about subtle forms of pleasure because they're similar to simpler forms of pleasure. I care about more complex notions of "fairness" and "freedom" because they're similar to simpler notions of "fairness" and "freedom". I care about the concept of "real strawberries" because it's similar to the concept of "sensory information about strawberries". Etc.

    Or consider prehistoric people. Even by today's standards, they had a lot of non-trivial positive values ("friendship", "love", "adventure", etc.) and could've easily lived very moral lives, if they avoided violence. Giant advances in knowledge and technology didn't change human values that much. Humans want to have relatively simple lives. Optimizing overly complex variables would make life too chaotic, uncontrollable, and unpleasant.

  4. ^

    Note that it would be pretty natural to care about the existence of the pseudorandom number generator, but "the existence of the PRNG" is a much more comprehensible variable than "the current value of the PRNG".

    Also, as far as I'm aware, Super Mario Bros. doesn't actually have a pseudorandom number generator. But just imagine that it does.

  5. ^

    "Up to now it has been assumed that all our cognition must conform to the objects; but all attempts to find out something about them a priori through concepts that would extend our cognition have, on this presupposition, come to nothing. Hence let us once try whether we do not get farther with the problems of metaphysics by assuming that the objects must conform to our cognition, which would agree better with the requested possibility of an a priori cognition of them, which is to establish something about objects before they are given to us. This would be just like the first thoughts of Copernicus, who, when he did not make good progress in the explanation of the celestial motions if he assumed that the entire celestial host revolves around the observer, tried to see if he might not have greater success if he made the observer revolve and left the stars at rest." (c.) The Critique of Pure Reason, by Immanuel Kant, Bxvi–xviii

  6. ^

    I mean, we do care about that model being inner-aligned. But this is a separate problem. 

  7. ^

    the most instances in training

New Comment
34 comments, sorted by Click to highlight new comments since:
[-]TsviBTΩ8135

By default, humans only care about variables they could (in principle) easily optimize or comprehend.

I think this is incorrect. I think humans have values which are essentially provisional. In other words, they're based on pointers which are supposed to be impossible to fully dereference. Examples:

  1. Friendship--pointing at another mind, who you never fully comprehend, who can always surprise you--which is part of the point
  2. Boredom / fun--pointing at surprise, novelty, diagonalizing against what you already understand

See my response to David about a very similar topic. Lmk if it's useful.

Basically, I don't think your observation invalidates any ideas from the post.

The main point of the post is that human ability to comprehend should limit what humans can care about. This can't be false. Like, logically. You can't form preferences about things you can't consider. When it looks like humans form preferences about incomprehensible things, they really form preferences only about comprehensible properties of those incomprehensible things. In the post I make an analogy with a pseudorandom number generator: it's one thing to optimize a specific state of the PRNG or want the PRNG to work in a specific way, and another thing to want to preserve the PRNG's current algorithm (whatever it is). The first two goals might be incomprehensible, but the last goal is comprehensible. Caring about friends works in a similar way to caring about a PRNG. (You might dislike this framing for philosophical or moral reasons, that's valid, but it won't make object-level ideas from the post incorrect.)

When it looks like humans form preferences about incomprehensible things, they really form preferences only about comprehensible properties of those incomprehensible things

Then you're not talking about human values, you're talking about [short timescale implementations of values] or something.

I probably disagree. I get the feeling you have an overly demanding definition of "value" which is not necessary for solving corrigibility and a bunch of other problems. Seems like you want to define "value" closer to something like CEV or "caring about the ever-changing semantic essence of human ethical concepts". But even if we talk about those stronger concepts (CEV-like values, essences), I'd argue the dynamic I'm talking about ("human ability to comprehend limits what humans can care about") still applies to them to an important extent.

The issue is that the following is likely true according to me, though controversial:

The type of mind that might kill all humans has to do a bunch of truly novel thinking.

To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with "values" that are the sort of thing that can refer / be preserved / be transferred across "ontological" changes.

Quoting from https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html:

Rasha: "This will discover variables that you know how to evaluate, like where the cheese is in the maze--you have access to the ground truth against which you can compare a reporter-system's attempt to read off the position of the cheese from the AI's internals. But this won't extend to variables that you don't know how to evaluate. So this approach to honesty won't solve the part of alignment where, at some point, some mind has to interface with ideas that are novel and alien to humanity and direct the power of those ideas toward ends that humans like."

Thanks for elaborating! This might lead to a crux. Let me summarize the proposals from the post (those summaries can't replace reading the post though).

Outer alignment:

  1. We define something like a set of primitives. Those primitives are independent from any specific ontology.
  2. We prove[1] that as long as AI acts and interprets tasks using those primitives, it can prevent humans from being killed or brainwashed or disempowered. Even if the primitives are not enough to give a very nuanced definition of a "human" or "brainwashing". That's where the "we can express care about incomprehensible things as care about comprehensible properties of incomprehensible things" argument comes into play.

Inner alignment:

  1. We prove that a more complicated model (made of the primitives) can't deceive a simpler model (made of the primitives). The inner/outer alignment of simple enough models can be verified manually.
  2. We prove that the most complicated model (expressible with the primitives) has at least human-level intelligence.

Bonus: we prove that any model (made of the primitives) is interpretable/learnable by humans and prove that you don't need more complicated models for defining corrigibility/honesty. Disclaimer: the proposals above are not supposed to be practical, merely bounded and being conceptually simple.

Why the heck would we be able to define primitives with such wildly nice properties? Because of the argument that human ability to comprehend and act in the world limits what humans might currently care about, and the current human values are enough to express corrigibility. If you struggle to accept this argument, maybe try assuming it's true and see if you can follow the rest of the logic? Or try to find a flaw in the logic instead of disagreeing with the definitions. Or bring up a specific failure mode.

To have our values interface appropriately with with this novel thinking patterns in the AI, including through corrigibility, I think we have to work with "values" that are the sort of thing that can refer / be preserved / be transferred across "ontological" changes.

If you talk about ontological crisis or inner alignment, I tried to address those in the post. By the way, I read most of your blog post and skimmed the rest.

  1. ^

    To actually prove it we need to fully formalize the idea, of course. But I think my idea is more specific than many other alignment ideas (e.g. corrigibility, Mechanistic Anomaly Detection, Shard Theory). 

[call turns out to be maybe logistically inconvenient]

It's OK if a person's mental state changes because they notice a pink car ("human object recognition" is an easier to optimize/comprehend process). It's not OK if a person's mental state changes because the pink car has weird subliminal effects on the human psyche ("weird subliminal effects on the human psyche" is a harder to optimize/comprehend process).

So, somehow you're able to know when an AI is exerting optimization power in "a way that flows through" some specific concepts? I think this is pretty difficult; see the fraughtness of inexplicitness or more narrowly the conceptual Doppelgänger problem.

It's extra difficult if you're not able to use the concepts you're trying to disallow, in order to disallow them--and it sounds like that's what you're trying to do (you're trying to "automatically" disallow them, presumably without the use of an AI that does understand them).

You say this:

But I don't get if, or why, you think that adds up to anything like the above.

Anyway, is the following basically what you're proposing?

Humans can check goodness of because is only able to think using stuff that humans are quite familiar with. Then is able to oversee because... (I don't get why; something about mapping primitives, and deception not being possible for some reason?) Then is really smart and understands stuff that humans don't understand, but is overseen by a chain that ends in a good AI, .

So, somehow you're able to know when an AI is exerting optimization power in "a way that flows through" some specific concepts?

Yes, we're able to tell if AI optimizes through a specific class of concepts. In most/all sections of the post I'm assuming the AI generates concepts in a special language (i.e. it's not just a trained neural network), a language which allows to measure the complexity of concepts. The claim is that if you're optimizing through concepts of certain complexity, then you can't fulfill a task in a "weird" way. If the claim is true and AI doesn't think in arbitrary languages, then it's supposed to be impossible to create a harmful Doppelganger.

But I don't get if, or why, you think that adds up to anything like the above.

Clarification: only the interpretability section deals with inner alignment. The claims of the previous sections are not supposed to follow from the interpretability section.

Anyway, is the following basically what you're proposing?

Yes. The special language is supposed to have the property that can automatically learn if  plans good, bad, or unnecessary actions.  can't be arbitrarily smarter than humans, but it's a general intelligence which doesn't imitate humans and can know stuff humans don't know.

Yes. The special language is supposed to have the property that Ak can automatically learn if Ak+1 plans good, bad, or unnecessary actions. An can't be arbitrarily smarter than humans, but it's a general intelligence which doesn't imitate humans and can know stuff humans don't know.

So to my mind, this scheme is at significant risk of playing a shell game with "how the AIs collectively use novel structures but in a way that is answerable to us / our values". You're saying that the simple AI can tell if the more complex AI's plans are good, bad, or unnecessary--but also the latter "can know stuff humans don't know". How?

In other words, I'm saying that making it so that

the AI generates concepts in a special language

but also the AI is actually useful at all, is almost just a restatement of the whole alignment problem.

First things first, defining a special language which creates a safe but useful AGI absolutely is just a restatement of the problem, more or less. But the post doesn't just restate the problem, it describes the core principle of the language (the comprehension/optimization metric) and makes arguments for why the language should be provably sufficient for solving a big part of alignment.

You're saying that the simple AI can tell if the more complex AI's plans are good, bad, or unnecessary--but also the latter "can know stuff humans don't know". How?

This section deduces the above from claims A and B. What part of the deduction do you disagree with/confused about? Here's how the deduction would apply to the task "protect a diamond from destruction":

  1. cares about an ontologically fundamental diamond.  models the world as clouds of atoms.
  2. According to the principle, we can automatically find what object in  corresponds to the "ontologically fundamental diamond".
  3. Therefore, we can know what  plans would preserve the diamond. We also can know if applying any weird optimization to the diamond is necessary for preserving it. Checking for necessity is probably hard, might require another novel insight. But "necessity" is a simple object-level property.

The automatic finding of the correspondence (step 2) between an important comprehensible concept and an important incomprehensible concept resolves the apparent contradiction.[1]

  1. ^

    Now, without context, step 2 is just a restatement of the ontology identification problem. The first two sections of the post (mostly the first one) explain why the comprehension/optimization metric should solve it. I believe my solution is along the lines of the research avenues Eliezer outlined.

    If my principle is hard to agree with, please try to assume that it's true and see if you can follow how it solves some alignment problems.

the current human values are enough to express corrigibility

Huh? Not sure I understand this. How is this the case?

(I may have to tap out, because busy. At some point we could have a call to chat--might be much easier to communicate in that context. I think we have several background disagreements, so that I don't find it easy to interpret your statements.)

[-]plex90

This is actually pretty cool! Feels like it's doing the type of reasoning that might result in critical insight, and maybe even is one itself. It's towards the upper tail of the distribution of research I've read by people I'm not already familiar with.

I think there's big challenges to this solving AGI alignment including: probably this restriction bounds AI's power a lot, but still feels like a neat idea and I hope you continue to explore the space of possible solutions.

If something is too hard to optimize/comprehend, people couldn't possibly optimize/comprehend it in the past, so it couldn't be a part of human values.

I don't understand why this claim would be true.

Take the human desire for delicious food; humans certainly didn't understand the chemistry of food and the human brain well enough to comprehend it or directly optimize it, but for millennia we picked foods that we liked more, explored options, and over time cultural and culinary processes improved on this poorly understood goal.

[-]Q HomeΩ130

Yes, some value judgements (e.g. "this movie is good", "this song is beautiful", or even "this is a conscious being") depend on inscrutable brain machinery, the machinery which creates experience. The complexity of our feelings can be orders of magnitude greater than the complexity of our explicit reasoning. Does it kill the proposal in the post? I think not, for the following reason:

We aren't particularly good at remembering exact experiences, we like very different experiences, we can't access each other's experiences, and we have very limited ways of controlling experiences. So, there should be pretty strict limitations on how much understanding of the inscrutable machinery is required for respecting the current human values. Defining corrigible behavior ("don't kill everyone", "don't seek power", "don't mess with human brains") shouldn't require answering many specific, complicated machinery-dependent questions ("what separates good and bad movies?", "what separates good and bad life?", "what separates conscious and unconscious beings?").

Also, some thoughts about your specific counterexample (I generalized it to being about experiences in general):

  • "How stimulating or addicting or novel is this experience?" <- I think those parameters were always comprehensible and optimizable, even in the Stone Age. (In a limited way, but still.) For example, it's easy to get different gradations of "less addicting experiences" by getting injuries, starving or not sleeping.
  • "How 'good' is this experience in a more nebulous or normative way?" <- I think this is a more complicated value (aesthetic taste), based on simpler values.
  • Note that I'm using "easy to comprehend" in the sense of "the thing behaves in a simple way most of the time", not in the sense of "it's easy to comprehend why the thing exists" or "it's easy to understand the whole causal chain related to the thing". I think the latter senses are not useful for a simplicity metric, because they would mark everything as equally incomprehensible.
  • Note that "I care about taste experiences" (A), "I care about particular chemicals giving particular taste experiences" (B), and "I care about preserving the status quo connection between chemicals and taste experiences" (C) are all different things. B can be much more complicated than C, B might require the knowledge of chemistry while C doesn't. 

Does any of the above help to find the crux of the disagreement or understand the intuitions behind my claim?

I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution. 

So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don't think that this is stationary, so I'm not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways. 

But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.

[-]Q HomeΩ110

Are you talking about value learning? My proposal doesn't tackle advanced value learning. Basically, my argument is "if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility". My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is "if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as 'human values'".[1]

The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I'm not talking about extrapolating something out of distribution. Unless I'm missing your point.

  1. ^

    Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might've failed.

No, the argument above is claiming that A is false.

[-]Q HomeΩ110

But

  • To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can't be false.
  • Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn't seem controversial. (The ideas themselves are controversial, but for other reasons.)
  • If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn't it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?

Based on your comments, I can guess that something below is the crux:

  1. You define "values" as ~"the decisions humans would converge to after becoming arbitrarily more knowledgeable". But that's a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that "past human ability to comprehend limits human values" — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
  2. You say that values depend on inscrutable brain machinery. But can't we treat the machinery as a part of "human ability to comprehend"?
  3. You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define "ability to comprehend" based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can't comprehend atomic theory.

Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?

To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. 


You're conflating can and should! I agree that it would be ideal if this were the case, but am skeptical it is. That's what I meant when I said I think A is false.

  • If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn't it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?

That's a very big "if"! And simplicity priors are made questionable, if not refuted, by the fact that we haven't gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.

You define "values" as ~"the decisions humans would converge to after becoming arbitrarily more knowledgeable".

No, I think it's what humans actually pursue today when given the options. I'm not convinced that these values are static, or coherent, much less that we would in fact converge.

You say that values depend on inscrutable brain machinery. But can't we treat the machinery as a part of "human ability to comprehend"?

No, because we don't comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that. (Then we apply pretty-sounding but ultimately post-hoc reasoning to explain it - as I tweeted partly thinking about this conversation.)

Thanks for clarifying! Even if I still don't fully understand your position, I now see where you're coming from.

No, I think it's what humans actually pursue today when given the options. I'm not convinced that these values are static, or coherent, much less that we would in fact converge.

Then those values/motivations should be limited by the complexity of human cognition, since they're produced by it. Isn't that trivially true? I agree that values can be incoherent, fluid, and not converging to anything. But building Task AGI doesn't require building an AGI which learns coherent human values. It "merely" requires an AGI which doesn't affect human values in large and unintended ways.

No, because we don't comprehend them, we just evaluate what we want locally using the machinery directly, and make choices based on that.

This feels like arguing over definitions. If you have an oracle for solving certain problems, this oracle can be defined as a part of your problem-solving ability. Even if it's not transparent compared to your other problem-solving abilities. Similarly, the machinery which calculates a complicated function from sensory inputs to judgements (e.g. from Mona Lisa to "this is beautiful") can be defined as a part of our comprehension ability. Yes, humans don't know (1) the internals of the machinery or (2) some properties of the function it calculates — but I think you haven't given an example of how human values depend on knowledge of 1 or 2. You gave an example of how human values depend on the maxima of the function (e.g. the desire to find the most delicious food), but that function having maxima is not an unknown property, it's a trivial property (some foods are worse than others, therefore some foods have the best taste).

That's a very big "if"! And simplicity priors are made questionable, if not refuted, by the fact that we haven't gotten any convergence about human values despite millennia of philosophy trying to build such an explanation.

I agree that ambitious value learning is a big "if". But Task AGI doesn't require it.

Thanks, this was pretty interesting.

Big problem is the free choice of "conceptual language" (universal Turing machine) when defining simplicity/comprehensibility. You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it'll be shared between the humans and the AI. That's not necessarily true, which creates a lot of leaks where an AI might do something that's simple in the AI's internal representation but complicated in the human's.

It's OK to make cars pink by using paint ("spots of paint" is an easier to optimize/comprehend variable). It's not OK to make cars pink by manipulating individual water droplets in the air to create an elaborate rainbow-like illusion ("individual water droplets" is a harder to optimize/comprehend variable).

This raises a second problem, which is the "easy to optimize" criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain. But if we let environmental availability weigh on "easy to to optimize," then the agent will be happy to switch from real paint to a hologram or a human-hack once the technology for those becomes developed and commodified.

When the metric is a bit fuzzy and informal, it's easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.

You at various points rely on an assumption that there is one unique scale of complexity (one ladder of ), and it'll be shared between the humans and the AI. That's not necessarily true, which creates a lot of leaks where an AI might do something that's simple in the AI's internal representation but complicated in the human's.

I think there are many somewhat different scales of complexity, but they're all shared between the humans and the AI, so we can choose any of them. We start with properties () which are definitely easy to understand for humans. Then we gradually relax those properties. According to the principle,  properties will capture all key variables relevant to the human values much earlier than top human mathematicians and physicists will stop understanding what those properties might describe. (Because most of the time, living a value-filled life doesn't require using the best mathematical and physical knowledge of the day.) My model: "the entirety of human ontology >>> the part of human ontology a corrigible AI needs to share".

This raises a second problem, which is the "easy to optimize" criterion, and how it might depend on the environment and on what tech tree unlocks (both physical and conceptual) the agent already has. Pink paint is pretty sophisticated, even though our current society has commodified it so we can take getting some for granted. Starting from no tech tree unlocks at all, you can probably get to hacking humans before you can recreate the Sherwin Williams supply chain.

There are three important possibilities relevant to your hypothetical:

  • If technology T and human hacking are equally hard to comprehend, then (a) we don't want the AI to build technology T or (b) the AI should be able to screen off technology T from humans more or less perfectly. For example, maybe producing paint requires complex manipulations with matter, but those manipulations should be screened off from humans. The last paragraph in this section mentions a similar situation.
  • Technology T is easier to comprehend than human hacking, but it's more expensive (requires more resources). Then we should be able to allow the AI to use those resources, if we want to. We should be controlling how much resources the AI is using anyway, so I'm not introducing any unnatural epicycles here.[1]
  • If humans themselves built technology T which affects them in a complicated way (e.g. drugs), it doesn't mean the AI should build similar types of technology on its own.

My point here is that I don't think technology undermines the usefulness of my metric. And I don't think that's a coincidence. According to the principle, one or both of the below should be true:

  1. Up to this point in time, technology never affected what's easy to optimize/comprehend on a deep enough level.
  2. Up to this point in time, humans never used technology to optimize/comprehend (on a deep enough level) most of their fundamental values.

If neither were true, we would believe that technology radically changed fundamental human values at some point in the past. We would see life without technology as devoid of most non-trivial human values.

When the metric is a bit fuzzy and informal, it's easy to reach convenient/hopeful conclusions about how the human-intended behavior is easy to optimize, but it should be hard to trust those conclusions.

The selling point of my idea is that it comes with a story for why it's logically impossible for it to fail or why all of its flaws should be easy to predict and fix. Is it easy to come up with such story for other ideas? I agree that it's too early to buy that story. But I think it's original and probable enough to deserve attention.

  1. ^

    Remember that I'm talking about a Task-directed AGI, not a Sovereign AGI.

So I didnt read your whole post but I basically agree that many alignment agendas come down to very similar things [eg ARC wordcelery]

The idea you are looking for is the Kolmogorov structure function.  

Could you ELI15 the difference between Kolmogorov complexity (KC) and Kolmogorov structure function (KSF)?

Here are some of the things needed to formalize the proposal in the post:

  1. A complexity metric defined for different model classes.
  2. A natural way to "connect" models. So we can identify the same object (e.g. "diamond") in two different models. Related: multi-level maps.

I feel something like KSF could tackle 1, but what about 2?

Circling back to this. I'm interested in your thoughts. 

I think the Algorithmic Statistics framework [including the K-structure function] is a good fit for what you want here in 2. 

to recall the central idea is that any object is ultimately just a binary string  that we encode through a two-part code: a code  encoding a finite set of strings  such that  with a pointer to 

 within 

For example  could encode a dataset while  would encode the typical data strings for a given model probability distribution in a set of hypotheses for some small . This is a way to talk completely deterministically about (probabilistic model), e.g. like a LLM trained in a transformer architecture. 

This framework is flexible enough to describe two codes  encoding  such that 

and  as required. One can e.g. easily find simple examples of this using mixtures of gaussians.  

I'd be curious what you think!

Got around to interrogating Gemini for a bit.

Seems like KSF talks about programs generating sets. It doesn't say anything about the internal structure of the programs (but that's where the objects such as "real diamonds" live). So let's say  is a very long video about dogs doing various things. If I apply KSF, I get programs (aka "codes") generating sets of videos. But it doesn't help me identify "the most dog-like thing" inside each program. For example, one of the programs might be an atomic model of physics, where "the most dog-like things" are stable clouds of atoms. But KSF doesn't help me find those clouds. A similarity metric between videos doesn't help either.

My conceptual solution to the above problem, proposed in the post: if you have a simple program with special internal structure describing simple statistical properties of "dog-shaped pixels" (such program is guaranteed to exist), there also exists a program with very similar internal structure describing "valuable physical objects causing dog-shaped pixels" (if such program doesn't exist, then "valuable physical objects causing dog-shaped pixels" don't exist either).[1] Finding "the most dog-like things" in such program is trivial. Therefore, we should be able to solve ontology identification by heavily restricting the internal structure of programs (to structures which look similar to simple statistical patterns in sensory data).

So, to formalize my "conceptual solution" we need models which are visually/structurally/spatially/dynamically similar to the sensory data they model. I asked Gemini about it, multiple times, with Deep Research. The only interesting reference Gemini found is Agent-based models (AFAIU, "agents" just means "any objects governed by rules").

  1. ^

    This is not obvious, requires analyzing basic properties of human values. 

Here - plugging into Gemini should work. 

see also Ilya's 30 paper list to John Carmack. 


EDIT: ah I see you're asking for a more defined question. 
I don't have time to answer in detail atm but it's a very good question. 

One thing you might want to take a look at is: https://homepages.cwi.nl/~paulv/papers/similarity.pdf

https://homepages.cwi.nl/~paulv/papers/itw05.pdf

I apologize, I didn't read in full, but I'm curious if you considered the case of, for example, the Mandelbrot set? A very simple equation specifies an infinitely precise, complicated set. If human values have this property than it would be correct to say the Kolmogorov complexity of human values is very low, but there are still very exacting constraints on the universe for it to satisfy human values.

Don't worry about not reading it all. But could you be a bit more specific about the argument you want to make or the ambiguity you want to clarify? I have a couple of interpretations of your question.

Interpretation A:

  1. The post defines a scale-dependent metric which is supposed to tell how likely humans are to care about something.
  2. There are objects which are identical/similar on every scale. Do they break the metric? (Similar questions can be asked about things other than "scale".) For example, what if our universe contains an identical, but much smaller universe, with countless people in it? Men In Black style. Would the metric say we're unlikely to care about the pocket universe just because of its size?

Interpretation B:

  1. The principle says humans don't care about constraining things in overly specific ways.
  2. Some concepts with low Kolmogorov Complexity constrain things in infinitely specific ways.

My response to B is that my metric of simplicity is different from Kolmogorov Complexity.

Thanks for responding : )

A is amusing, definitely not what I was thinking. B seems like it is probably what I was thinking, but I'm not sure, and don't really understand how having a different metric of simplicity changes things.

While the true laws of physics can be arbitrarily complicated, the behavior of variables humans care about can't be arbitrarily complicated.

I think this is the part that prompted my question. I may be pretty far off of understanding what you are trying to say, but my thinking is basically that I am not content with the capabilities of my current mind, so I would like to improve it, but in doing so I would be capable of having more articulate preferences, and my current preference would define a function from the set of possible preferences to an approval rating such that I would be trying to improve my mind in such a way that my new more articulate preferences are the ones I most approve of or find sufficiently acceptable.

If this process is iterated, it defines some path or cone from my current preferences through the space of possible preferences moving from less to more articulate. It might be that other people would not seek such a thing, though I suspect many would, but with less conscientiousness about what they are doing. It is also possible there are convergent states where my preferences and capabilities would determine a desire to remain as I am. ( I am mildly hopeful that that is the case. )

It is my understanding that the mandelbrot set is not smooth at any scale (not sure if anyone has proven this), but that is the feature I was trying to point out. If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the "variables humans care about can't be arbitrarily complicated", but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.

I think I understand you now. Your question seems much simpler than I expected. You're basically just asking "but what if we'll want infinitely complicated / detailed values in the future?"

If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the "variables humans care about can't be arbitrarily complicated", but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.

It's OK if the principle won't be true for humans in the future, it only needs to be true for the current values. Aligning AI to some of the current human concepts should be enough to define corrigibility, low impact, or avoid goodharting. I.e. create a safe Task AGI. I'm not trying to dictate to anyone what they should care about.

Hmm... I appreciate the response. It makes me more curious to understand what you're talking about.

At this point I think it would be quite reasonable if you suggest that I actually read your article instead of speculating about what it says, lol, but if you want to say anything about my following points of confusion I wouldn't say no : )

For context my current view is that value alignment is the only safe way to build ASI. I'm less skeptical about corrigible task ASI than prosaic scaling with RLHF, but I'm currently still quite skeptical in absolute terms. Roughly speaking, prosaic kills us, task genie maybe kills us maybe allows us to make stupid wishes which harm us. I'm kinda not sure if you are focusing on stuff that takes us from prosaic from to task genie, or that helps with task genie not killing us. I suspect you are not focused on task genie allowing us to make stupid wishes, but I'd be open to hearing I'm wrong.

I also have an intuition that having preferences for future preferences is synonymous with having those preferences, but I suppose there are also ways in which they are obviously different, ie their uncompressed specification size. Are you suggesting that limiting the complexity of the preferences the AI is working off of to similar levels of complexity of current encodings of human preferences (ie human brains) ensures the preferences aren't among the set of preferences that are misaligned because they are too complicated (even though the human preferences are synonymous with more complicated preferences). I think I'm surely misunderstanding, maybe the way you are applying the natural abstraction hypothesis, or possibly a bunch of things.

Could you reformulate the last paragraph as "I'm confused how your idea helps with alignment subrpoblem X", "I think your idea might be inconsistent or having a failure mode because of Y", or "I'm not sure how your idea could be used to define Z"?

Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won't kill/brainwash/disempower people before it's instructed to not do this). The post is not about value learning in the sense of "the AI learns plus-minus the entirety of human ethics and can build an utopia on its own". I think developing my idea could help with such value learning, but I'm not sure I can easily back up this claim. Also, I don't know how to apply my idea directly to neural networks.

Could you reformulate the last paragraph

I'll try. I'm not sure how your idea could be used to define human values. I think your idea might have a failure mode around places where people are dissatisfied with their current understanding. I.e. situations where a human wants a more articulate model of the world then they have.

The post is about corrigible task ASI

Right. That makes sense. Sorry for asking a bunch of off topic questions then. I worry that task ASI could be dangerous even if it is corrigible, but ASI is obviously more dangerous when it isn't corrigible, so I should probably develop my thinking about corrigibility.

More from Q Home
Curated and popular this week