I think I have an idea how we could solve AI Alignment, create an AGI with safe and interpretable thinking. I mean a "fundamentally" safe AGI, not a wildcard that requires extremely specific training to not kill you. Disclaimer: I'm not taking AI safety lightly in any case.

Sorry for a grandiose claim. I'm going to write my idea right away. Then I'm going to explain the context and general examples of it, implications of it being true. Then write about AI Alignment directly. Then suggest a specific thing we can do. Then explain why I believe my idea is true. Then I'm going to say what I intend to do if I don't get a discussion/the discussion leads to nothing.

My idea will sound too vague and unclear at first. But I think the context will make it clear what I mean. (Clear as the mathematical concept of a graph, for example: a graph is a very abstract idea, but makes sense and easy to use.)

I'm aware that my post can be seen as hand-waving and bullshit. But I'm writing this to save people I care about.

Please evaluate my post at least as science fiction and then ask: maybe it's not fiction and just reality? Or "OK, the examples are weird and silly, but what if everything really works like that?"

Key points of this post:

  • You can "solve" human concepts (including values) by solving semantics. By semantics I mean "meaning construction", something more abstract than language.
  • Semantics is easier to solve than you think. And we're closer to solving it than you think.
  • Semantics is easier to model than you think. You don't even need an AI to start doing it. Just a special type of statistics. You don't even have to start with analyzing language.
  • I believe ideas from this post can be applied outside of AI field. For example, the ideas about hypotheses have consequences for Rationality.

Why do I believe this? Because of this idea:

  1. Every concept (or even random mishmash of ideas) has multiple versions. Those versions have internal relationships, positions in some space relative to each other. You can understand a concept by understanding those internal relationships.
  2. One problem though, those relationships are "infinitely complex". However, there's a special way to make drastic simplifications. We can study the real relationships through those special simplifications.
  3. What do those "special simplifications" do? They order versions of a concept (e.g. "version 1, version 2, version 3"). They can do this in extremely arbitrary ways. The important thing is that you can merge arbitrary orders into less arbitrary structures. There's some rule for it, akin to the Bayes Rule or Occam's razor. This is what cognition is, according to my theory.

If this is true, we need to find any domain where concepts and their simplifications are easy enough to formalize. Then we need to figure out a model, figure out the rule for merging simplifications. I've got a specific suggestion and a couple of ideas and many examples.

You may compare my theory to memetics, if memetics were making much stronger and more universal claims, for example: "Every single thought is a "meme". Any thought propagates according to the same simple rules. If you understand the propagation of any type of thoughts, you understand the propagation of all other types. You then may build AGI or become smarter yourself. Or apply those rules outside of AI field. By the way, almost any type of thoughts is open to direct introspection." Even if a theory is vague, such statements have extremely specific implications. Such theory maximally constraints what needs to be discovered.

Context

This section of the post got so long because I was still getting ideas while writing it. And I need to cover some inferential distance. I want to explain what my idea means for qualia & perception, language, hypotheses generation, argumentation & decision making.

You're probably used to "bottom-up" theories. They start with specific statements and go to something more general. They explicitly constrain you by saying what can and what can't happen in the world. You can picture the entire theory in your head. Shard theory (research program) is an example.

My theory is "top-down". It starts with something very general, but gets more specific with each step. It implicitly constrains you by telling you what you can do (and in what way) to explore the world and which actions are going to bring you easier success. You may not immediately notice how specific the theory got and in what way it limited your options. "Top-down" theory is like a gradient. You may not be able to picture the entire theory in your head.

0. Motivation

With my idea I want to:

  • Explain what "meaning" is. Most of the concepts (in language or in the mind) can't be defined and heavily/entirely depend on context. How do they have any meaning?
  • Explain why "meaning" is important. What are the unexpected implications of the existence of "meaning".
  • Differentiate human thinking from many mindless AIs that just memorize words in various contexts. Explain how to build models more similar to humans.
  • Explain what the key rule/mechanic/process of creating meaning is supposed to be.
  • Give ideas that are interesting outside of AI field. Some ideas that can be applied to statistics and Rationality.

The default explanation of meaning "meaning is a place in a semantic space" or "meaning is a cluster of things" does only (1). But it doesn't try to differentiate human thinking from simple AIs, explain the process of creating meaning, give ideas useful outside of the AI field. Or show why "meaning" is an important thing.

Sorry for another bold claim: I think there's just no alternative to my idea.

The examples below can make you think that this idea "explains everything". But the point of my idea is to give you simple, novel tools that promise an easy way to model/analyze a lot of things. (Just want to make sure there's no misunderstanding.)


1.1 Properties of Qualia

There's the hard problem of consciousness: how is subjective experience created from physical stuff? (Or where does it come from?)

But I'm interested in a more specific question:

  • Does qualia have properties? What are they?

For example, "How do qualia change? How many different qualia can be created?" or "Do qualia form something akin to a mathematical space, e.g. a vector space? What is this space exactly?"

Is there any knowledge contained in the experience itself, not merely associated with it?1 For example, "cold weather can cause cold (disease)" is a fact associated with experience, but isn't very fundamental to the experience itself. And this "fact" is even false, it's a misconception/coincidence.

When you get to know the personality of your friend, do you learn anything "fundamental" or really interesting by itself? Is "loving someone" a fundamentally different experience compared to "eating pizza" or "watching a complicated movie"?

Those questions feel pretty damn important to me! They're about limitations of your meaningful experience and meaningful knowledge. They're about personalities of people you know or could know. How many personalities can you differentiate? How "important/fundamental" are those differences? And finally... those questions are about your values.

Those questions are important for Fun Theory. But they're way more important/fundamental than Fun Theory.

1 Philosophical context for this question: look up Immanuel Kant's idea of "synthetic a priori" propositions.

1.2 Qualia and morality

And those questions are important for AI Alignment. If an AI can "feel" that loving a sentient being and making a useless paperclip are 2 fundamentally different things, then it might be way easier to explain our values to that AI. The AI would already be biased towards our values. By the way, I'm not implying that AI has to have qualia, I'm saying that our qualia can hint us towards the right model.

I think this observation gets a little bit glossed over: if you have a human-like brain and only care about paperclips... it's (kind of) still objectively true for you that caring about other people would feel way different, way "bigger" and etc. You can pretend to escape morality, but you can't escape your brain.

It sounds extremely banal out of context, but the landscape of our experiences and concepts may shape the landscape of our values. Modeling our values as arbitrary utility functions (or artifacts of evolution) misses that completely.

I think the most fundamental prior reason to become moral is "that's just how experience/world works". (Not a sufficient reason, but a very strong bias.)

Conclusion: if internal structure of concepts exists, it is REALLY good news. Is it likely to exist? I think "yes".

1.3 Meta Qualia

"Meta-qualia" is the knowledge about qualia itself. Or qualia about qualia itself (like "meta-thoughts" are thoughts about thoughts themselves). Or "unreal" qualia in-between normal qualia, qualia that can be accessed only via analysis of other qualia (the same way some meta-thoughts can be accessed only by a long analysis of other thoughts). Or latent mechanisms that correspond to some inner "structures" in qualia.

If knowledge about qualia itself exists, then it's the most general and deep type of knowledge. Because such knowledge may be true for any conscious being with any qualia. If some conscious beings live without our physics and our math, "meta qualia" facts may still hold true for them. It's like "Cogito, ergo sum", but potentially deeper and less vacuous.


2.1 Mystery Boxes

Box A

There's a mystery Box A. Each day you find a random object inside of it. For example: a ball, a flower, a coin, a wheel, a stick, a tissue...

Box B

There's also another box, the mystery Box B. One day you find a flower there. Another day you find a knife. The next day you find a toy. Next - a gun. Next - a hat. Next - shark's jaws...

...

How to understand the boxes? If you could obtain all items from both boxes, you would find... that those items are exactly the same. They just appear in a different order, that's all.

I think the simplest way to understand Box B is this: you need to approach it with a bias, with a "goal". For example "things may be dangerous, things may cause negative emotions". In its most general form, this idea is unfalsifiable and may work as a self-fulfilling prophecy. But this general idea may lead to specific hypotheses, to estimating specific probabilities. This idea may just save your life if someone is coming after you and you need to defend yourself.

Content of both boxes changes in arbitrary ways. But content change of the second box comes with an emotional cost.

There're many many other boxes, understanding them requires more nuanced biases and goals.

I think those boxes symbolize concepts (e.g. words) and the way humans understand them. I think a human understands a concept by assigning "costs" to its changes of meaning. "Costs" come from various emotions and goals.

"Costs" are convenient: if any change of meaning has a cost, then you don't need to restrict the meaning of a concept at all. If a change has a cost, then it's meaningful regardless of its predictability.

One day you find a lot of money in the Box B. When you open the box the next day... a nasty beetle bites you in the leg. You wasn't expecting that, that was "out of the box". But the gradient of the "cost" was leading you in the right direction anyway. Completely unexpected, but extremely meaningful and insightful event.

2.2 More Boxes

More examples of mystery boxes:

  1. First box may alternate positive and negative items.
  2. Second box may alternate positive, directly negative and indirectly negative items. For example, it may show you a knife (directly negative) and then a feather (indirectly negative: a bird can be killed by a knife). Or some worms and then an apple.
  3. Third box may alternate positive, negative and "subverted" items. For example, it may show you a seashell (positive), and then show you shark's jaws (negative). But both sharks and seashells have a common theme, so "seashell (positive)" got subverted.
  4. Fourth box may alternate negative items and items that "neutralize" negative things. For example, it may show you a sword, but then show you a shield.
  5. Fifth box may show you that every negative thing has many related positive things.

You can imagine a "meta box", for example a box that alternates between being the 1st box and the 2nd box. Meta boxes can "change their mood".

I think, in a weird way, all those boxes are very similar to human concepts and words.

The more emotions, goals and biases you learn, the easier it gets for you to understand new boxes. But those "emotions, goals, biases" are themselves like boxes.

2.3 Words

This is a silly, wacky subjective example. I just want to explain the concept.

Here are some meanings of the word "beast":

  • (archaic/humorous) any animal.
  • an inhumanly cruel, violent, or depraved person.
  • a very skilled human. example: "Magnus Carlsen (chessplayer) is a beast"
  • something very different and/or hard. example: "Reading modern English is one thing, but understanding Shakespeare is an entirely different beast."
  • a person's brutish or untamed characteristics. example: "The beast in you is rearing its ugly head"

What are the internal relationships between these meanings? If these meanings create a space, where is each of the meanings? I think the full answer is practically unknowable. But we can "probe" the full meaning, we can explore a tiny part of it:

Let's pick a goal (bias), for example: "describing deep qualities of something/someone". If you have this goal, the negative meaning ("cruel person") of the word is the main one for you. Because it can focus on the person's deep qualities the most, it may imply that the person is rotten to the core. Positive meaning focuses on skills a lot, archaic meaning is just a joke. 4rd meaning doesn't focus on specific internal qualities. 5th meaning may separate the person from their qualities.

When we added a goal, each meaning started to have a "cost". This cost illuminates some part of the relationships between the meanings. If we could evaluate an "infinity" of goals, we could know those relationships perfectly. But I believe you can get quite a lot of information by evaluating just a single goal. Because a "goal" is a concept too, so you're bootstrapping your learning. And I think this matches closely with the example about mystery boxes.

...

By combining a couple of goals we can make an order of the meanings, for example: beast 1 (rotten to the core), beast 2 (skilled and talented person), beast 3 (bad character traits), beast 4 (complicated thing), beast 5 (any animal). This order is based on "specificity" (mostly) and "depth" of a quality: how specific/deep is the characterization?

Another order: beast 1 (not a human), beast 2 (worse than most humans), beast 3 (best among professionals), beast 4 (not some other things), beast 5 (worse than yourself). This order is based on the "scope" and "contrast": how many things contrast with the object? Notice how each order simplifies and redefines the meanings. If you make enough such orders, you may start getting synesthesia-like experiences about meanings, feeling their "depth" and "scope" and other properties.

But I want to illustrate the process of combining goals/biases on a real order:

2.4 Grammar Rules

You may treat this part of the post as complete fiction. But it illustrates how biases can be combined. And this is the most important thing about biases.

Grammar rules are concepts too. Sometimes people use quite complicated rules without even realizing, for example:

Adjective order or Adjectives: order, video by Tom Scott

There's a popular order: opinion, size, physical quality or shape, age, colour, origin, material, purpose. What created this order? I don't know, but I know that certain biases could make it easier to understand.

Take a look at this part of the order: opinion, age, origin, purpose. You could say all those are not "real" properties. They seem to progress from less related/less specific to the object to more related/specific. If you operate under this bias (relatedness/specificity), swapping the adjectives may lead to funny changes of meaning. For example: "bad old wolf" (objective opinion), "old bad wolf" (intrinsic property or cheesy overblown opinion), "old French bad wolf" (a subspecies of the "French wolf"). You can remember how mystery boxes created meaning using order of items.

Another part of the order: size, physical quality or shape, color, material. You can say all those are "real" physical properties. "Size" could be possessed by a box around the object. "Physical quality" and "shape" could be possessed by something wrapped around the object. "Color" could be possessed by the surface of the object. "Material" can be possessed only by the object itself. So physical qualities progress like layers of an onion.

You can combine those two biases ("relatedness/specificity" + "onion layers") using a third bias and some minor rules. The third bias may be "attachment". Some of the rules: (1) an adjective is attached either to some box around the object or to some layer of the object (2) you shouldn't postulate boxes that are too big. It doesn't make sense for an opinion to be attached to the object stronger than its size box. It doesn't make sense for age to be attached to the object stronger than its color (does time pass under the surface layer of an object?). Origin needs to be attached to some layer of the object (otherwise we would need to postulate a giant box that contains both the object and its place of origin). I guess it can't be attached stronger than "material" because material may expand the information about origin. And purpose is the "soul" of the object. "Attachment" is a reformulation of "relatedness/specificity", so we only used 2.5 biases to order 8 things. Unnecessary biases just delete themselves.

Of course, this is all still based on complicated human intuitions and high level reasoning. But, I believe, at the heart of it lies a rule as simple as the Bayes Rule or Occam's razor. A rule about merging arbitrary connections into something less arbitrary.

...

I think stuff like sentence structure/word order (or even morphology) is made of amalgamations of biases too. Similarly, differences between words (e.g. "what vs. which") can be described as biases ("which" is a bias towards something more specific).

Sadly, it's quite useless to think about it. We don't have enough orders like this. And we can't create such orders ourselves (as a game), i.e. we can't model this, it's too subjective or too complicated. We have nothing to play with here. But what if we could do all of this for some other topic?


3.1 Argumentation

I believe my idea has some general and specific connections to hypotheses generation and argumentation. The most trivial connection is that hypotheses and arguments use concepts and themselves are concepts.

You don't need a precisely defined hypothesis if any specification of your hypothesis has a "cost". You don't need to prove and disprove specific ideas, you may do something similar to the "gradient descent". You have a single landscape with all your ideas blended together and you just slide over this landscape. The same goes for arguments: I think it is often sub-optimal to try to come up with a precise argument. Or waste time and atomize your concepts in order to fix any inconsequential "inconsistency".

A more controversial idea would be that (1) in some cases you can apply wishful thinking, since "wishful thinking" is able to assign emotional "costs" to theories (2) in some cases motivated reasoning is even necessary for thinking. My theory already proposes that meaning/cognition doesn't exist without motivated reasoning.

3.2 Working with hypotheses

My idea suggests a new specific way to work with hypotheses.

A quote from Harry Potter and the Methods of Rationality, Chapter 22: The Scientific Method

Observation:

Wizardry isn't as powerful now as it was when Hogwarts was founded.

Hypotheses:

  1. Magic itself is fading.
  2. Wizards are interbreeding with Muggles and Squibs.
  3. Knowledge to cast powerful spells is being lost.
  4. Wizards are eating the wrong foods as children, or something else besides blood is making them grow up weaker.
  5. Muggle technology is interfering with magic. (Since 800 years ago?)
  6. Stronger wizards are having fewer children.

...

I think each idea has an "infinity" of nuances and an "infinity" of versions. It's very hard to understand where to start in order to check all those ideas.

But we can reformulate the hypotheses in terms of each other (simplifying them along the way), for example:

  • (1) Magic is fading away by itself. (2) Magic mixes with non-magic. (3) Pieces of magic are lost. (4) Something affects the magic. (5) The same as 2 or 4. (6) Magic creates less magic.
  • (1) Pieces of magic disappear by themselves. (2) ??? (3) Pieces of magic containing spells disappear. (4) Wizards don't consume/produce enough pieces of magic. (5) Technology destroys pieces of magic. (6) Stronger wizards produce fewer pieces of magic.

Why do this, again? I think it makes hypotheses less arbitrary and highlights what we really know. And it rises questions that are important across many theories: can magic be split into discrete pieces? can magic "mix" with non-magic? can magic be stronger or weaker? can magic create itself? By the way, those questions would save us from trying to explain a nonexistent phenomenon: maybe magic isn't even fading in the first place, do we really know that?

3.3 New Occam's Razor, new probability

And this way hypotheses are easier to order according to our a priori biases. We can order hypotheses exactly the same way we ordered meanings if we reformulate them to sound equivalent to each other. Here's an example how we can re-order some of the hypotheses:

  1. Pieces of magic disappear by themselves.
  2. Pieces of magic containing spells disappear.
  3. Wizards don't consume/produce enough pieces of magic.
  4. Stronger wizards produce fewer pieces of magic.
  5. Technology destroys pieces of magic.

The hypotheses above are sorted by 3 biases: "Does it describe HOW magic disappears?/Does magic disappear by itself?" (stronger positive weight) and "How general is the reason of the disappearance of magic?" (weaker positive weight) and "novelty compared to other hypotheses" (strong positive weight). "Pieces of magic containing spells disappear" is, in a way, the most specific hypotheses here, but it definitely describes HOW magic disappears (and gives a lot of new information about it), so it's higher on the list. "Technology destroys pieces of magic" doesn't give any new information about anything whatsoever, only a specific random possible reason, so it's the most irrelevant hypothesis here. By the way, those 3 different biases are just different sides of the same coin: "magic described in terms of magic/something else" and "specificity" and "novelty" are all types of "specificity". Or novelty. Biases are concepts too, you can reformulate any of them in terms of the others too.

When you deal with hypotheses that aren't "atomized" and specific enough, Occam's Razor may be impossible to apply. Because complexity of a hypothesis is subjective in such cases. What I described above solves that: complexity is combined with other metrics and evaluated only "locally". By the way, in a similar fashion you can update the concept of probability. You can split "probability" in multiple connected metrics and use an amalgamation of those metrics in cases where you have absolutely no idea how to calculate the ratio of outcomes.

3.4 "Matrices" of motivation

You can analyze arguments and reasons for actions using the same framework. Imagine this situation:

You are a lonely person on an empty planet. You're doing physics/math. One day you encounter another person, even though she looks a little bit like a robot. You become friends. One day your friend gets lost in a dangerous forest. Do you risk your life to save her? You come up with some reasons to try to save her:

  • I care about my friend very much. (A)
  • If my friend survives, it's the best outcome for me. (B)
  • My friend is a real person. (C)

You can explore and evaluate those reasons by formulating them in terms of each other or in other equivalent terms.

  • "I'm 100% sure I care. (A) Her survival is 90% the best outcome for me in the long run. (B) Probably she's real (C)." This evaluates the reasons by "power" (basically, probability).
  • "My feelings are real. (A) The goodness/possibility of the best outcome is real. (B) My friend is probably real. (C)" This evaluates the reasons by "realness".
  • "I care 100%. (A) Her survival is 100% the best outcome for me. (B) She's 100% real. (C)." This evaluates the reasons by "power" strengthened by emotions: what if the power of emotions affected everything else just a tiny bit? By a very small factor.
  • "Survival of my friend is the best outcome for me. (B) The fact that I ended up caring about my friend is the best thing that happened to me. Physics and math aren't more interesting than other sentient beings. (A) My friend being real is the best outcome for me. But it isn't even necessary, she's already "real" in most of the senses. (C)" This evaluates the reasons by the quality of "being the best outcome" (for you and for her).

Some evaluations may affect others, merge together. I believe the evaluations written above only look like precise considerations, but actually they're more like meanings of words, impossible to pin down. I gave this example because it's similar to some of my emotions.

I think such thinking is more natural than applying a pre-existing utility function that doesn't require any cognition. Utility of what exactly should you calculate? Of your friend's life? Of your own life? Of your life with your friend? Of your life factored by your friend's desire "be safe, don't risk your life for me"? Should you take into account change of your personality over time? I believe you can't learn the difference without working with "meaning".


3.5 Gradient for theories

You can model opinions and theories as simple equations, for example "evolution = values". Then you can explore those equations by ordering them according to different biases.

For example, take a look at those 2 ideas:

  • Our values are a product of our experience.
  • Our values are a product of evolution.

We can imagine different combinations of those 2 ideas and order them according to some biases:

  1. Values are a product of experience.
  2. Values are a product of evolution. But "evolution" (in some way) is mostly analogous to experience.
  3. Values were a product of evolution. But now they're independent of evolution. And created independently.
  4. Values are a product of evolution. But evolution is a little bit analogous to experience.
  5. Values are a product of evolution. Nothing else (almost).

Those opinions are ordered according to "idealism" (the main factor: how idealistic is the theory?) and "the amount of new information" (how much new information does the theory give?). If you combine both biases: how much new information about something idealistic does the theory give? This order reflects how I rank those theories. I believe my bias towards "idealism" is justified by the type of evidence I have. And by my goal: I want a general theory. And by my possibilities: I believe I can't reach a specific theory for humans before reaching the general one. I mean at the end of the day, my goal is to learn something about values, not about evolution. But this all is just an example.

You can view the order above as a decision tree or something similar. It shows you a "good road" (idealism) and a "bad road" (evolution) and ways to return from the bad road back on the good one. Sorry for a pop-sci analogy, but maybe it's more similar to the way a photon travels through all possible paths: you don't believe in a single theory ranked the highest, you believe in all theories that can return on the "good road". (Return some useful information for the "good road".)


4.1 Synesthesia

I think applying my idea to perception leads to synesthesia-like experiences.

Imagine a face. When you don't simplify it, you just see a face and emotions expressed by it. When you simplify it too much, you just see meaningless visual information (geometric shapes and color spots).

But I believe there's something very interesting in-between. When information is complex enough to start making sense, but isn't complex enough to fully represent a face. You may see unreal shapes (mixes of "face shapes" and "geometric shapes"... or simplifications of specific face shapes) and unreal emotions (simplifications of specific emotions) and unreal face textures (simplifications of specific face textures).

How do we know that those simplifications exist? My idea just postulates that they do. A "face" is a concept, so you can compare faces by creating "simplifications" of them. And the same is true for any other experience/mixture of experiences. I experienced that myself.

...

I want to repeat: I think the most fundamental prior reason to become moral is "that's just how experience/world works".

All qualia is created by comparing different versions of experience. But different "versions" of your experience are other people. Experience makes you deeply entangled with other people. The "gradient" of experience leads towards becoming more entangled with other people.

4.2 Unsupervised learning

I'm not sure, but maybe my idea leads to unsupervised learning.

If you experience the world the way I described, then it's very easy for you to find a goal. Just analyzing different things and experiences, for example.


5.1 Complexity of values

You may skip sections 5.1 and 5.2

We know that the first idea is true and the second one is false:

  • Humans values are complex. We have many different values.
  • Human values are simple. We have a single value.

But I think there's no contradiction between the two ideas. It's not so easy to judge what's simple and what's complex, or if there's a single value or many values. I think both ideas are true, but the second is more true in my framework.

My idea suggests that biases tend to add up, merge together. I think human values do add up to something simple, at least on some reasonable level of simplification. Some moral "macro bias".

5.2 Macro biases

This is more of a speculation. But my idea tells that cognition works by merging biases. And groups of individuals (communicating with each other) should work the same way too. So, appearance of some "macro biases" is more likely than not.

For example, I guess there should be a "macro bias" that characterizes English language and differentiates it from other languages, for example. At least on some level of simplification.

And I guess there should be "macro biases" that differentiate thinking of different people. I think Rationality would be more effective if it could address those biases. This is a controversial opinion, but I think no one, including every rationalist, operates on logic and data alone.

If we perceive the world by merging biases, some "macro biases" in our minds are bound to occur.


6.1 "Anti-orders" and "hyper orders"

Imagine you're ordering different beings according to the bias/gradient of "strength":

a small fish < a snake < a medieval knight < a dragon

And then you encounter a really BUFF, platinum chihuahua. With a bazooka (for good measure). The bazooka shoots small black holes. (Other weird warriors you encounter: a small bird moving faster than a bullet, antimatter lion, a sentient ocean.)

Is chihuahua weaker than a snake or stronger than a dragon? You can try to fit it into the power order, but it may screw the gradient.

So, you may place the chihuahua in an "anti-order" or "hyper order". "Anti-order" is like a shadow of the normal order, an order for objects with ambiguous gradient. An order and anti-order are somewhat similar to positive and negative numbers: when you take an object and "anti-object", you may compare them directly or you may try to take their "absolute values".

"Hyper orders" (and "hyper anti-orders") are for things that would have too big or too small gradients in the normal order/anti-order. Continuing the useless number analogy, "hyper orders" are somewhat like transfinite numbers.

I think this topic is interesting because it gives insight about orders. But I also think that "anti-orders" and "hyper orders" are related to strong and unusual human emotions. I mean, this follows from my idea: "emotions" are concepts too, you can simplify and order them.

6.2 Unusual and strong experiences

You can skip to "7. Acasual trade".

Those are not very serious/useful "models", I just want to paint a picture. But orders are never 100% objective anyway, they're simplifications of concepts:

I think "humorous" things are anti-objects or hyper objects compared to normal things. For example, the BUFF chihuahua.

Uncanny valley things are anti-objects or hyper objects compared to normal ones.

Painful experiences form an anti-order compared to pleasurable experiences. Extremely painful experiences form a hyper order.

Your experiences of eating food form a normal order. Your experiences of deeply loving someone form a hyper order.

Imagine some normal places. It's a normal order. Now imagine The Backrooms/The Poolrooms/Weirdcore/Dreamcore/some places from your dreams (other things I mentioned here are internet folklore and aesthetics). It's an anti-order or hyper order.

Ego death/ego-loss and meditation and the flow state are "anti-states" and "hyper states" of your ego.

Remember, my framework is supposed to model qualia and "meta qualia".

6.3 Unusual emotions

Imagine that you order happy and sad songs (by sound, not by lyrics), more or less arbitrarily. Of course, there will be a lot of ambiguous mixes of happy and sad emotions.

But the idea of "anti-gradients" and "hyper gradients" suggests that you will eventually stumble on something too ambiguous: some ambiguity that can't be resolved, some perfect balance between contradictory emotions. I experienced it very rarely, but I heard about 15 examples (that work for me) in my entire life. About 6 of them stimulated my fantasy the most, had a very unusual effect on me.

If you analyze music as combinations of emotions (or other "gradients)", you'll eventually stumble on such songs.

This is all subjective, but here're some examples that work for me:

  • Pantera - Floods. A very famous outro. One way to simplify my feelings: it sounds as a combination of erratic panic and calm hope. Contradicting emotions and contradicting speeds.
  • The Avalanches - Electricity (Dr. Rockit's Dirty Kiss), my simplification of the intro: sounds like a combination of a pompous confident march and melancholy, some abandoned base in a dense forest in an post-apocalyptic world without humans. Has the strongest effect on me.
  • Younger Brother - Your Friends Are Scary simplification: it sounds as a combination of a very steady constant rhythm and a constantly speeding up rhythm, a combination of being scared and feeling confident.
  • Younger Brother - Psychic Gibbon simplification: sounds like a happy confidence and a cry of despair and utter disgust.

More:

When you simplify the sound texture of a song, you're making a "mask" of the song. You can try to remember the sound of the song by thinking about the mask. You can even overlay the mask of one song over another song in your mind: fill the mask of one song with the sounds of another song.

6.4 Meaning

Have you ever heard a song and thought "this is a really interesting story, even though I can't imagine any specific detail of it"?

Have you ever watched a movie and thought "ah, something epic is going to happen, even though I have no idea what exactly!"?

Have you ever struggled with a complex idea and thought "I almost got it, but I don't get it yet"?

Have you ever loved someone and thought "I can't express my love, but I want to say/do something"?

I think there's a special itch associated with some "pure" meaning itself. Or meaning and anticipation. I think it's described by gradient too. I haven't thought about it much yet, but I believe it's important.


7. Acasual trade

Acausal Trade

There's a Blue machine and a Red machine. Blue wants to eat blue candies. Red wants to eat red. Each machine is gonna be isolated in a Blue Candy type of world or in the Red Candy type of world. Both can get unlucky.

They win the largest amount of candy if Blue commits to build Red and vice-versa. (I could get the example wrong, but I think it doesn't matter here.)

Normal order, anti-order and hyper orders are like different worlds. However, gradient in one world is dependent on the gradients in other worlds. You can think that concepts (or biases) do acasual trade across those worlds. Each bias wants to be more important, each concept wants to be higher in the order.

The final gradient is the result of acasual trade between the concepts. The same trade happens within the realm of a single order too.

Maybe the idea of the "gradient of concepts" immediately leads to the idea of acasual trade. When you think in terms of the gradient, you don't really know your values (or what they'll be in the future), you know only the direction of the evolution of your values. You always walk forward blind. So, you need to make some trades at least with yourself. Remember 3.4 "Matrices" of motivation. I can't draw a clear connection, but I'm sure it means something.

Maybe quantum sci-pop analogy is relevant again: you don't decide to care about (X) because of a single path of thinking (your path). When you decide to care about (X) you care about everyone who could care about (X) in whatever unlikely circumstances. (Another thought:) When you kick someone, you kick yourself, so you need to really make sure you would agree to be kicked.

Good news: the simplest model of acasual trade may be applicable to the core of all cognition. (check out the link: "an example of acausal trade with simple resource requirements")


Alignment

What about AI Alignment itself? How do we make sure that "Gradient AGI" doesn't kill us? How can "Gradient AGI" save us from potentially more practical "Paperclip AGI" developed by some greedy people? We have at least two problems.

I see those possibilities:

  • I think there's a chance that my idea will give us "absolute" understanding of the way human values work. We can apply this understanding to (testing) any AI design.
  • I believe my idea is (currently) the best way to understand human values. It's our best chance anyway.
  • We may align people and make ourselves smarter. By understanding our cognition better.

I believe it's possible/necessary. It's my deepest goal. I'll tell you why later.

  • We may ask Gradient AGI to align people, if Gradient AGI is easier to build than Paperclip AGI.
  • We may slap Gradient AGI on top of Paperclip AGI as its "head". If Gradient AGI is easy enough to build.
  • Maybe "Gradient AI" (not AGI) can constrain Paperclip AGI in some way.
  • If we deal only with Gradient AGI, we probably have all possible options for alignment. Any moral idea or warning you can explain to a human we can explain to such AGI (and we will have more and better ideas). It's 100% safe in the sense that it's not an evil genie. It understands meaning and the possibility of being mistaken (like we do).

I think "Gradient AGI" would be superintelligent, at least compared to us today.

I also think that Paperclip AGI (or any other utility maximizer) isn't really general in the philosophical sense. Maybe it's a useless fact, maybe it's not.


Action

If my idea is true, what can we do?

  1. We need to figure out the way to combine biases.
  2. We need to find some objects that are easy to model.
  3. We need to find "simplifications" and "biases" for those objects that are easy to model.

We may start with some absolutely useless objects.

What can we do? (in general)

However, even from made-up examples (not connected to a model) we can be getting some general ideas:

  • Different versions of a concept always get described in equivalent terms and simplified. (When a "bias" is applied to the concept.)
  • Orders lead to "anti-orders" and "hyper orders". They can give insight how biases work.
  • Multiple biases may turn the concept into something like a matrix?
  • Combined biases are similar to a decision tree.
  • Acasual trade (the simplest models) may describe how biases combine. (a part of it)

It's not fictional evidence because at this point we're not seeking evidence, we're seeking a way to combine biases.

What specific thing can we do?

I have a topic in mind: (because of my synesthesia-like experiences)

You can analyze shapes of "places" and videogame levels (3D or even 2D shapes) by making orders of their simplifications. You can simplify a place by splitting it into cubes/squares, creating a simplified texture of a place. "Bias" is a specific method of splitting a place into cubes/squares. You can also have a bias for or against creating certain amounts of cubes/squares.

  1. 3D and 2D shapes are easy to model.
  2. Splitting 3D/2D shapes into cubes or squares is easy to model.
  3. Measuring the amount of squares/cubes in an area of a place is easy to model.

Here's my post about it: "Colors" of places. The post gets specific about the way(s) of evaluating places. Everything I wrote in this post applies there. I believe it's specific enough so that we could come up with models. I think this is a real chance.

I probably explained everything badly in that post, but I could explain it better with feedback.

Maybe we could analyze people's faces the same way, I don't know if faces are easy enough to model. Maybe "faces" have too complicated shapes.

My evidence

I've always had an obsession with other people.

I compared any person I knew to all other people I knew. I tried to remember faces, voices, ways to speak, emotions, situations, media associated with them (books, movies, anime, songs, games).

If I learned something from someone (be it a song or something else), I associated this information with them and remembered the association "forever". To the point where any experience was associated with someone. Those associations weren't something static, they were like liquid or gas, tried to occupy all available space.

At some point I knew that they weren't just "associations" anymore. They turned into synesthesia-like experiences. Like a blind person in a boat, one day I realized that I'm not in a river anymore, I'm in the ocean.

What happened? I think completely arbitrary associations with people where putting emotional "costs" on my experiences. Each arbitrary association was touching on something less arbitrary. When it happened enough times, I believe associations stopped being arbitrary.

"Other people" is the ultimate reason why I think that my idea is true. Often I doubt myself: maybe my memories don't mean anything? Other times I feel like I didn't believe in it enough.

...

When a person dies, it's already maximally sad. You can't make it more or less sad.

But all this makes it so, so much worse. Imagine if after the death of an author all their characters died too (in their fictional worlds) and memories about the author and their characters died too. Ripples of death just never end and multiply. As if the same stupid thing repeats for the infinith time.

My options

For me, my theory is as complex as the statement "water is wet". But explaining it is hard.

I can't share my experience of people with you. But I could explain how you can explore your own experience. How you can turn your experience of people (or anything else) into simplified "gradients" and then learn to combine those gradients. I could explain how you can see "gradients" of theories and arguments. I believe it would make both of us smarter.

Please, help me with my topic. If nobody helps, I guess my options (right now) are those:

  • Try to explain "making/merging gradients" to someone on whatever examples. Videogame levels, chess positions, character analysis, arguments.
  • I can even try writing to people who experience synesthesia.
  • Create smaller posts here about subtopics of my idea. E.g. about hypotheses.

I'm afraid it's going to take a lot of time.

You may ask: "I thought you believe one can get smarter using this idea, why don't you try it yourself?" Two reasons: (1) I don't know if I can figure out how exactly gradients merge on my own (2) I don't know how much time it's going to take.

The most tiring thing is not to have unknown experiences, but to experience unknown value of unknown experiences. I feel like this world is a grey prison. In my dreams I see Freedom. But I don't know how to explain it to other prisoners. Even if I can show you a "green field" and an "open blue sky", why are those weird objects important? Why is Freedom supposed to be contained in such extremely specific things? Many very different groups of prisoners have long ago decided that the best thing we can do is to optimize our life in prison.

The idea of "gradients" gives me hope that I finally will be able to describe the meaning of Freedom directly. The thing I felt all my life. I always thought that the true Freedom lies in the ability to explore qualia with other people. But I didn't know how to communicate that. Is it too late now?

14

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 8:33 PM

A very enigmatic essay. The ideas I extracted so far: The mind has a fundamental capacity for ranking possibilities according to some order; meaning, value, and rationality, despite their apparent differences, are all examples of this faculty of ranking in action; cognition consists of merging different rankings ('amalgamation of biases'); discovering the rules of this amalgamation, is the key to AI alignment.

I agree with this summary. The idea's that any human-like thinking (or experience) looks similar on all levels/in all domains and it's simple enough (open to introspection). If it's true, then there's some easy way to understand human values.

If nothing happens in the discussion of this post, my next post may be about a way to analyze human values (using the idea of "merging biases"). It will be way shorter. It will be inspired by Bodily autonomy, but my example won't be political.

My suspicion is that you have not figured out how to phrase this in terms of math, and that the mathematical language used int his post feels right to you but isn't doing anything useful. You can prove me wrong by writing a very short post that explains the math, as math that I would find useful, and sets up an example.

I don't have a model. The point of my idea is to narrow down what model is needed (and where/how we can easily find it). The point of math language ("acasual trade" and "decision trees") is the same.

Everything mentioned in the post is like a container. Container may not model what's inside of it at all, but it limits the amount of places we need to check out (in order to find what we want). If we don't easily find what we wanted by looking into the container (and a little bit around it), then my idea is useless.

Can anything besides useful math change your opinion in any way? I saw your post (Models Modeling Models, 1. Meanings of words):

When I say "I like dancing," this is a different use of the word 'like,' backed by a different model of myself, than when I say "I like tasting sugar." The model that comes to mind for dancing treats it as one of the chunks of my day, like "playing computer games" or "taking the bus." I can know what state I'm in (the inference function of the model) based on seeing and hearing short scenes. Meanwhile, my model that has the taste of sugar in it has states like "feeling sandpaper" or "stretching my back." States are more like short-term sensations, and the described world is tightly focused on my body and the things touching it.

I think my theory talks about the same things, but more and deeper. I want to try to prove that you can't rationally prefer your theory to mine.

This is the trippiest thing I've read here in a while: congratulations!

If you'd like to get some more concrete feedback from the community here, I'd recommend phrasing your ideas more precisely by using some common mathematical terminology, e.g. talking about sets, sequences, etc. Working out a small example with numbers (rather than just words) will make things easier to understand for other people as well.

I'm bad at math. But I know a topic where you could formulate my ideas using math. I could try to formulate them mathematically with someone's help.

I can give a very abstract example. It's probably oversimplified (in a wrong way) and bad, but here it is:

You got three sets, A {9, 1} and B {5, -3} and C {4, 4}. You want to learn something about the sets. Or maybe you want to explain why they're ordered A > C > B in your data. You make orders of those sets using some (arbitrary) rules. For example:

  1. A {9} > B {5} > C {4}. This order is based on choosing the largest element.
  2. A {10} > C {8} > B {2}. This order is based on adding elements.
  3. A {10} > C {8} > B {5}. This order is based on this: you add the elements if the number grows bigger, you choose the largest element otherwise. It's a merge of the previous 2 orders.

If you want to predict A > C > B, you also may order the orders above:

  • (2) > (3) > (1). This order is based on predictive power (mostly) and complexity.
  • (2) > (1) > (3). This order is based on predictive power and complexity (complexity gives a bigger penalty).
  • (3) > (2) > (1). This order is based on how large the numbers in the orders are.

This example is likely useless out of context. But you read the post: so, if there's something you haven't understood just because it was confusing without numbers, then this example should clarify something to you. For example, it may clarify what my post misses to be understandable/open to specific feedback.

If you'd like to get some more concrete feedback from the community here, I'd recommend phrasing your ideas more precisely by using some common mathematical terminology, e.g. talking about sets, sequences, etc.

"No math, no feedback" if this is an irrational requirement it's gonna put people at risk. Do you think there isn't any other way to share/evaluate ideas? For example, here're some notions:

  • On some level our thoughts do consist of biases. See "synaptic weight". My idea says that "biases" exist on (almost) all levels of thinking and those biases are simple enough/interpretable enough. Also it says that some "high-level thinking" or "high-level knowledge" can be modeled by simple enough biases.
  • You could compare my theory to other theories. To Shard Theory, for example. I mean, just to make a "map" of all theories: where each theory lies relative to the others. Shard Theory says that value formation happens through complex enough negotiation games between complex enough objects (shards). My theory says that all cognition happens because of a simpler process between simpler objects.

I think it would be simply irrational to abstain from having any opinions about those notions. Do you believe there's something simpler (and more powerful) than Shard Theory? Do you believe that human thinking and concepts are intrinsically complex and (usually) impossible to simplify? Etc.

A rational thing would be to say your opinions about this and say what could affect those opinions. You already said about math, but there should be some other things too. Simply hearing some possibilities you haven't considered (even without math) should have at least a small effect on your estimates.

Just to engage a bit this idea:

There's a popular order: opinion, size, physical quality or shape, age, colour, origin, material, purpose. What created this order? I don't know, but I know that certain biases could make it easier to understand.

How would you model the fact that non-English languages use different orders?

This example isn't supposed to be a falsifiable model. The example is supposed to make you think: maybe things that look like complicated formal rules are actually made up of "biases" that are simple enough and don't directly encode any formal rules. This may be a possibility that you never considered (or maybe you did).

So I would try to model another order the same way, but using different biases. However, I've read that the order isn't unique to English.

Love this! 


We're working on something related / similar:
https://forum.effectivealtruism.org/posts/FnviTNXcjG2zaYXQY/how-to-store-human-values-on-a-computer 

This was cross-posted in less wrong, as well, where it's received a lot of criticism:
https://www.lesswrong.com/posts/rt2Avf63ADbzq9SuC/how-to-store-human-values-on-a-computer 
 

If this idea isn't ruled out, then it may be the simplest possible explanation (of the way humans create/think about concepts). I think there isn't a lot of such ideas.

New to LessWrong?