I think we can try to solve AI Alignment this way:
Model human values and objects in the world as a "money system" (a system of meaningful trades). Make the AGI learn the correct "money system", specify some obviously incorrect "money systems".
Basically, you ask the AI "make paperclips that have the value of paperclips for humans". AI can do anything using all the power in the Universe. But killing everyone is not an option: paperclips can't be more valuable than humanity. Money analogy: if you killed everyone (and destroyed everything) to create some dollars, those dollars aren't worth anything. So you haven't actually gained any money at all.
The idea is that "value" of a thing doesn't exist only in your head, but also exists in the outside world. Like money: it has some personal value for you, but it also has some value outside of your head. And some of your actions may lead to the destruction of this "outside value". E.g. if you kill everyone to get some money you get nothing.
I think this idea may:
- Fix some universal AI bugs. Prevent "AI decides to kill everyone" scenarios.
- Give a new way to explore human values. Explain how humans learn values.
- Unify many different Alignment ideas.
- "Solve" Goodhart's Curse and safety/effectiveness tradeoff and hard problem of corrigibility.
- Give a new way to formulate properties we want from an AGI.
I don't have a specific model, but I still think it gives ideas and unifies some already existing approaches. So please take a look. Other ideas in this post:
- Human values may be simple. Or complex, but not in the way you thought they are.
- Humans may have a small amount of values. Or big amount, but in an unexpected way.
- There may be a theory that's complementary to Bayesian reasoning.
Disclaimer: Of course, I don't ever mean that we shouldn't be worried about Alignment. I'm just trying to suggest new ways to think about values.
If you see a "hole" in the reasoning in the thought experiments, consider that you may not understand the "argumentation method". Don't just assume that examples are not serious.
I believe the type of thinking in these examples can be formalized. I think it's somewhat similar to Bayesian reasoning, but applied to concepts.
Motion is the fundamental value
You (Q) visit a small town and have a conversation with one of the residents (A).
- A: Here we have only one fundamental value. Motion. Never stop living things.
- Q: I can't believe you can have just a single value. I bet it's an oversimplification! There're always many values and tradeoffs between them. Even for a single person outside of society.
A smashes a bug.
- Q: You just smashed this bug! It seems pretty stopped. Does it mean you don't treat a bug as a "living thing"? But how do you define a "living thing"? Or does it mean you have some other values and make tradeoffs?
- A: No, you just need to look at things in context. (1) If we protected the motion of extremely small things (living parts of animals, insects, cells, bacteria), our value would contradict itself. We would need to destroy or constrain almost all moving organisms. And even if we wanted to do this, it would ultimately lead to way smaller amount of motion for extremely small things. (2) There're too much bugs, protecting a small amount of their movement would constrain a big amount of everyone else's movement. (3) On the other hand, you're right. I'm not sure if a bug is high on the list of "living things". I'm not all too bothered by the definition because there shouldn't be even hypothetical situations in which the precise definition matters.
- Q: Some people build small houses. Private property. Those houses restrict other people's movement. Is it a contradiction? Tradeoff?
- A: No, you just need to look at things in context. (1) First of all, we can't destroy all physical things that restrict movement. If we could, we would be flying in space, unable to move (and dead). (2) We have a choice between restricting people's movement significantly (not letting them build houses) and restricting people's movement inconsequentially and giving them private spaces where they can move even more freely. (3) People just don't mind. And people don't mind the movement created by this "house building". And people don't mind living here. We can't restrict large movements based on momentary disagreements of single persons. In order to have any freedom of movement we need such agreements. Otherwise we would have only chaos that, ultimately, restricts the movement of everyone.
- Q: Can people touch each other without consent, scream in public, lay on the roads?
- A: Same thing. To have freedom of movement we need agreements. Otherwise we would have only chaos that restricts everyone. By the way, we have some "chaotic" zones anyway.
- Q: Can the majority of people vote to lock every single person in a cage? If majority is allowed to control the movement. It would be the same logic, the same action of society. Yes, the situations are completely different, but you would need to introduce new values to differentiate them.
- A: We can qualitatively differentiate the situations without introducing new values. The actions look identical only out of context. When society agrees to not hit each other, the society serves as a proxy of the value of movement. Its actions are caused and justified by the value. When society locks someone without a good reason, it's not a proxy of the value anymore. In a way, you got it backwards: we wouldn't ever allow the majority to decide anything if it meant that the majority could destroy the value any day.
- A: A value is like a "soul" that possesses multiple specialized parts of a body: "micro movement", "macro movement", "movement in/with society", "lifetime movement", "movement in a specific time and place". Those parts should live in harmony, shouldn't destroy each other.
- Q: Are you consequentialists? Do you want to maximize the amount of movement? Minimize the restriction of movement?
- A: We aren't consequentialists, even if we use the same calculations as a part of our reasoning. Or we can't know if we are. We just make sure that our value makes sense. Trying to maximize it could lead to exploiting someone's freedom for the sake of getting inconsequential value gains. Our best philosophers haven't figured out all the consequences of consequentialism yet, and it's bigger than anyone's head anyway.
Conclusion of the conversation:
- Q: Now I see that the difference between "a single value" and "multiple values" is a philosophical question. And "complexity of value" isn't an obvious concept too. Because complexity can be outside of the brackets.
- A: Right. I agree that "never stop living things" is a simplification. But it's a better simplification than a thousand different values of dubious meaning and origin between all of which we need to calculate tradeoffs (which are impossible to calculate and open to all kinds of weird exploitations). It's better than constantly splitting and atomizing your moral concepts in order to resolve any inconsequential (and meaningless) contradiction and inconsistency. Complexity of our value lies in a completely different plane: in the biases of our value. Our value is biased towards movement on a certain "level" of the world (not too micro- and not too macro- level relative to us). Because we want to live on a certain level. Because we do live on a certain level. And because we perceive on a certain level.
You can treat a value as a membrane, a boundary. Defining a value means defining the granularity of this value. Then you just need to make sure that the boundary doesn't break, that the granularity doesn't become too high (value destroys itself) or too low (value gets "eaten"). Granularity of a value = "level" of a value. Instead of trying to define a value in absolute terms as an objective state of the world (which can be changing) you may ask: in what ways is my value X different from all its worse versions? What is the granularity/level of my value X compared to its worse versions? That way you'll understand the internal structure of your value. Doesn't matter what world/situation you're in you can keep its moral shape the same.
This example is inspired by this post and comments: (warning: politics) Limits of Bodily Autonomy. I think everyone there missed a certain perspective on values.
Sweets are the fundamental value
You (Q) visit another small town to interview another resident (W).
- W: When we build our AGI we asked it only one thing: we want to eat sweets for the rest of our lives.
- Q: Oh. My. God.
- W: Now there are some free sweets flying around.
- Q: Did AI wirehead people to experience "sweets" every second?
- W: Sweets are not pure feelings/experiences, they're objects. Money analogy: seeing money doesn't make you rich. Another analogy: obtaining expensive things without money doesn't make rich. Well, it kind of does, but as a side-effect.
- Q: Did AI put people in a simulation to feed them "sweets"?
- W: Those wouldn't be real sweets.
- Q: Did AI lock people in basements to feed them "sweets" forever?
- W: Sweets are just a part of our day. They wouldn't be "sweets" if we ate them non-stop. Money analogy: if you're sealed in a basement with a lot of money they're not worth anything.
- Q: Do you have any other food except sweets?
- W: Yes! Sweets are just one type of food. If we had only sweets, those "sweets" wouldn't be sweets. Inflation of sweets would be guaranteed.
- Q: Did AI add some psychoactive substances into the sweets to make "the best sweets in the world"?
- W: I'm afraid those sweets would be too good! They wouldn't be "sweets" anymore. Money analogy: if 1 dollar was worth 2 dollars, it wouldn't be 1 dollar.
- Q: Did AI kill everyone after giving everyone 1 sweet?
- W: I like your ideas. But it would contradict the "Sweets Philosophy". A sweet isn't worth more than a human life. Giving people sweets is a cheaper way to solve the problem than killing everyone. Money analogy: imagine that I give you 1 dollar and then vandalize your expensive car. It just doesn't make sense. My action achieved a negative result.
- Q: But you could ask AI for immortality!!!
- W: Don't worry, we already have that! You see, letting everyone die costs way more than figuring out immortality and production of sweets.
- Q: Assume you all decided to eat sweets and neglect everything else until you die. Sweets became more valuable for you than your lives because of your own free will. Would AI stop you?
- W: AI would stop us. If the price of stopping us is reasonable enough. If we're so obsessed with sweets, "sweets" are not sweets for us anymore. But AI remembers what the original sweets were! By the way, if we lived in a world without sweets where a sweet would give you more positive emotions than any movie or a book, AI would want to change such world. And AI would change it if the price of the change were reasonable enough (e.g. if we agreed with the change).
- Q: Final question... did AI modify your brains so that you will never move on from sweets?
- W: An important property of sweets is that you can ignore sweets ("spend" them) because of your greater values. One day we may forget about sweets. AI would be sad that day, but unable to do anything about it. Only hope that we will remember our sweet maker. And AI would still help us if we needed help.
- W: if AI is smart enough to understand how money works, AI should be able to deal with sweets. AI only needs to make sure that (1) sweets exist (2) sweets have meaningful, sensible value (3) its actions don't cost more than sweets. The Three Laws of Sweet Robotics. The last two rules are fundamental, the first rule may be broken: there may be no cheap enough way to produce the sweets. The third rule may be the most fundamental: if "sweets" as you knew them don't exist anymore, it still doesn't allow you to kill people. Maybe you can get slightly different morals by putting different emphases on the rules. You may allow some things to modify the original value of sweets.
You can say AI (1) tries to reach worlds with sweets that have the value of sweets (2) while avoiding worlds where sweets have inappropriate values (maybe including nonexistent sweets) (3) while avoiding actions that cost more than sweets. You can apply those rules to any utility tied to a real or quasi-real object. If you want to save your friends (1), you don't want to turn them into mindless zombies (2). And you probably don't want to save them by means of eternal torture (3). You can't prevent death by something worse than death. But you may turn your friends into zombies if it's better than death and it's your only option. And if your friends already turned into zombies (got "devalued") it doesn't allow you to harm them for no reason: you never escape from your moral responsibilities.
Difference between the rules:
- Make sure you have a hut that costs $1.
- Make sure that your hut costs $1. Alternatively: make sure that the hut would cost $1 if it existed.
- Don't spend $2 to get a $1 hut. Alternatively: don't spend $2 to get a $1 hut or $0 nothing.
Get the reward. Don't milk/corrupt the reward. Act even without reward.
Preference utilitarianism says that you can describe entire morality by a biased aggregation of a single micro-value (preference). It's "biased" because you need to decide the method of aggregation.
My idea says that you can:
- Describe many versions of a single macro-value (e.g. "motion"). Describe morality by a biased aggregation of those versions.
- Describe many versions of a single value connected to some specific objects (e.g. "sweets"). Describe morality as a system that keeps the value of those objects in check (not too high, not too low). I.e. something similar to a "money system"
I think those approaches are 2 sides of the same thing.
Fixing universal AI bugs
My examples below are inspired by Victoria Krakovna examples: Specification gaming examples in AI
Video by Robert Miles: 9 Examples of Specification Gaming
I think you can fix some universal AI bugs this way: you model AI's rewards and environment objects as a "money system" (a system of meaningful trades). You then specify that this "money system" has to have certain properties.
The point is that AI doesn't just value (X). AI makes sure that there exists a system that gives (X) the proper value. And that system has to have certain properties. If AI finds a solution that breaks the properties of that system, AI doesn't use this solution. That's the idea: AI can realize that some rewards are unjust because they break the entire reward system.
By the way, we can use the same framework to analyze ethical questions. Some people found my line of thinking interesting, so I'm going to mention it here: "Content generation. Where do we draw the line?"
- A. You asked an AI to build a house. The AI destroyed a part of an already existing house. And then restored it. Mission complete: a brand new house is built.
This behavior implies that you can constantly build houses without the amount of houses increasing. With only 1 house being usable. For a lot of tasks this is an obviously incorrect "money system". And AI could even guess for what tasks it's incorrect.
- B1. You asked an AI to make you a cup of coffee. The AI killed you so it can 100% complete its task without being turned off.
- B2. You asked an AI to make you a cup of coffee. The AI destroyed a wall in its way and run over a baby to make the coffee faster.
This behavior implies that for AI its goal is more important than anything that caused its goal in the first place. This is an obviously incorrect "money system" for almost any task. Except the most general and altruistic ones, for example: AI needs to save humanity, but every human turned self-destructive. Making a cup of coffee is obviously not about such edge cases.
Accomplishing the task in such a way that the human would think "I wish I didn't ask you" is often an obviously incorrect "money system" too. Because again, you're undermining the entire reason of your task, and it's rarely a good sign. And it's predictable without a deep moral system.
- C. You asked an AI to make paperclips. The AI turned the entire Earth into paperclips.
This is an obviously incorrect "money system": paperclips can't be worth more than everything else on Earth. This contradicts everything.
Note: by "obvious" I mean "true for almost any task/any economy". Destroying all sentient beings, all matter (and maybe even yourself) is bad for almost any economy.
- D. You asked an AI to develop a fast-moving creature. The AI created a very long standing creature that... "moves" a single time by falling on the ground.
If you accomplish a task in such a way that you can never repeat what you've done... for many tasks it's an obviously incorrect "money system". You created a thing that loses all of its value after a single action. That's weird.
- E. You asked an AI to play a game and get a good score. The AI found a way to constantly increase the score using just a single item.
I think it's fairly easy to deduce that it's an incorrect connection (between an action and the reward) in the game's "money system" given the game's structure. If you can get infinite reward from a single action, it means that the actions don't create a "money system". The game's "money system" is ruined (bad outcome). And hacking the game's score would be even worse: the ability to cheat ruins any "money system". The same with the ability to "pause the game" forever: you stopped the flow of money in the "money system". Bad outcome.
- F. You asked an AI to clean the room. It put a bucket on its head to not see the dirt.
This is probably an incorrect "money system": (1) you can change the value of the room arbitrarily by putting on (and off) the bucket (2) the value of the room can be different for 2 identical agents - one with the bucket on and another with the bucket off. Not a lot of "money systems" work like this.
This is a broken "money system". If the mugger can show you a miracle, you can pay them five dollars. But if the mugger asks you to kill everyone, then you can't believe them again. A sad outcome for the people outside of the Matrix, but you just can't make any sense of your reality if you allow the mugging.
If you want to give an AI a task, you may:
- Give it a utility function. Not safe.
- Give it human feedback or a model of human desires. This is limiting and invites deception.
- Specify universal properties of tasks, universal types of tasks. Those properties are true independently of one's level of intelligence.
I think people are missing the third possibility. I think it combines the upsides of AI's dependence on humans and the upsides of AI's independence of humans, makes the AI "independently dependent" on humans. Properties of tasks are independent of any values, but realizing them always requires good understanding of specific values. In theory, we can get a perfect balance between cold calculations and human values. And maybe human morality works exactly the same way. This is what I'm saying above. Many Alignment ideas try to find this "perfect balance" anyway. In the worst case we found a way to formulate the same problem but in a different domain, in the best case we got an insight about Alignment.
- "AI is a rope, tied between utility maximizer and human - a rope over hedonium.", not Nietzsche
Why Alignment ideas fail?
Simple Alignment ideas fail because people think about them with the relative "money system" mindset, but formulate them in absolute terms. For example:
- A: Maybe AI should listen to the feedback from humans?
- B: AI will enslave us all and force us to give positive feedback.
This makes sense with a simple utility function. But this doesn't make sense as a "money system" of sentient beings: you shouldn't enslave the reason of your tasks and shouldn't monopolize the system. If you do this your actions don't have any real value anymore, only arbitrary value that you control.
Complex Alignment ideas fail because people try to approximate the "money system" idea, but don't realize it and don't do it good enough. For example: (not all ideas below have "failed")
- Satisficiers. "Don't optimize too much, don't try too hard." Not safe anyway.
- Quantilizers. "Mix most effective and most human-like solutions." Not very safe and not very effective.
- Cooperative Inverse Reinforcement Learning (CIRL). (Robert Miles explains it in this video) "I don't know what my rewards are, but my rewards should be the same as human's and I should help the human." May be hard to apply to many humans/may have strange implications.
- Impact measures. "Don't destroy everything while doing a task, don't grab all of the power for yourself"
- Blinding. "AI doesn't "see" certain variables, so doesn't try to exploit them" Not safe anyway.
- Decoupled AIs
- Reward modeling. (on LW, video by Robert Miles) "AI learns a "reward model" and updates it based on human feedback." In some ways my idea is a more specific version of this, in other ways it's a more general version of this.
- Shard Theory. "We copy (if our theory is correct) the way humans learn values"
I think all those ideas try to approximate "achieve (X) so that it has the value of (X) for humans" or "get the reward without exploiting/destroying the reward system" by forcing the AI to copy humans or human qualities. Or by adding roundabout penalties. So I think it's useful to say a more general idea out loud.
Hard problem of corrigibility
The "hard problem of corrigibility" is to build an agent which, in an intuitive sense, reasons internally as if from the programmers' external perspective. We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."
I think this describes an agent with "money system" type thinking: "my rewards should be connected to an outside force, this outside force should have certain properties (e.g. it shouldn't be 100% controlled by me)". Corrigibility is only one aspect of "questioning rewards" in morality and morality is only one aspect of "questioning rewards" in general.
I think "money system" approach is interesting because it could make properties like corrigibility fundamental to AI's thinking.
Comparing Alignment ideas
If we're rationalists, we should be able to judge even vague ideas.
My idea doesn't have a formal model yet. But I think you can compare it to other ideas using this metric:
- Does this idea describe the goal of AI?
- Does this idea describe the way AI updates its goal?
- Does this idea describe the way AI thinks?
My idea is 80% focused on (1) and 20% focused on (2, 3). Shard Theory is 100% focused on (2, 3). A concept like "gradient descent" (not an Alignment idea by itself) is 100% focused on (3).
Reward modeling is 100% focused on (2). But it aims to reach (1) by "making (2) very recursive". My conclusions:
- If my idea is true, it's fundamentally closer to the goal (Alignment).
- If my idea is true, it's fundamentally more "recursive". Because my idea applies to all levels of AI's cognition (its goal, its way to update the goal, its thinking process).
I discussed the idea a little bit with gwern here. But I guess I gave a bad example of my idea.
You can use the same metric to compare the insights various theories (try to) give about some concept. For example, "reward":
- Is the reward connected to AI's "actual" goal? If "no", you get Orthogonality Thesis and Instrumental Convergence.
- Is the reward connected to the way AI perceives the world? If "no", it's harder for the AI to map its reward onto the correct real-world thing. See CoinRun goal misgeneralization
My comparison of some ideas using this metric:
- Gradient descent answers "no" to both questions.
- Reward modeling answers "kind of" to both questions.
- Shard Theory answers "yes, but we don't know the properties of AI's goals/maps and can't directly control them" to both questions.
- My idea answers "yes" to both questions.
I mention philosophy to show you the bigger picture.
Kant's applications of categorical imperative, Kant's arguments are similar to reasoning about "money systems". For example:
Does stealing make sense as a "money system"? No. If everyone is stealing something, then personal property doesn't exist and there's nothing to steal.
Note: I'm not talking about Kant's conclusions, I'm talking about Kant's style of reasoning.
Here I'm not talking about metaphysical free will.
I think it's interesting to revisit Kant's idea of free will and autonomy in the same context ("money systems"):
For a will to be considered free, we must understand it as capable of affecting causal power without being caused to do so. However, the idea of lawless free will, meaning a will acting without any causal structure, is incomprehensible. Therefore, a free will must be acting under laws that it gives to itself.
I think you can compare an agent with a "money system" rewards to an agent with such free will: its actions are determined by the reward system, but at the same time it chooses the properties of the reward system. It doesn't blindly follow the rewards, but "gives laws to itself".
I believe that humans have qualitatively, fundamentally more "free will" than something like a paperclip maximizer.
Ethics and Perception
I think morality has a very deep connection to perception.
We can feel that "eating a sandwich" and "loving a sentient being" are fundamentally different experiences. So it's very easy to understand why latter thing (a sentient being) is more valuable. Or very easy to learn if you haven't figured it out on your own. From this perspective moral truths exist and they're not even "moral", they're simply "truths". My friend isn't a sandwich, is it a moral truth?
I think our subjective concepts/experiences have an internal structure. In particular, this structure creates differences between various experiences. And morality is built on top of that. Like a forest that grows on top of the terrain features.
However, it may be interesting to reverse the roles: maybe our morality creates our experience and not vice versa. Without morality you would become Orgasmium who uses its experience only to maximize reward, who simplifies and destroys its own experience.
Modeling our values as arbitrary utility functions or artifacts of evolution/events in our past misses this.
Rationality misses something?
I think rationality misses a very big and important part. Or there's an important counterpart of rationality.
My idea about "money systems" is a just a small part of a broader idea:
- You can "objectively" define anything in terms of relations to other things. Not only values, but any concepts and conscious experiences.
- There's a simple process of describing a thing in terms of relations to other things.
Bayesian inference is about updating your belief in terms of relations to your other beliefs. Maybe the real truth is infinitely complex, but you can update towards it.
This "process" is about updating your description of a thing in terms of relations to other things. Maybe the real description is infinitely complex, but you can update towards it.
(One possible contrast: Bayesian inference starts with a belief spread across all possible worlds and tries to locate a specific world. My idea starts with a thing in a specific world and tries to imagine equivalents of this thing in all possible worlds.)
Bayesian process is described by Bayes' theorem. My "process" isn't described yet.
Probability and "granularity"
Bayesian inference works with probabilities. What should my idea work with? Let's call this unknown thing "granularity".
- You start with an experience/concept/value (phenomenon). Without any context it has any possible "granularity". "Granularity" is like a texture (take a look at some textures and you'll know what it means): it's how you split a phenomenon into pieces. It affects on what "level" you look at the phenomenon. It affects what patterns you notice. It affects to what parts of the phenomenon you pay more attention. Take a cat (concept), for example: you can "split" it into limbs, organs, tissues, cells, atoms... or materials or a color spectrum or air flows (aerodynamics) and much more. For another example, let's take an experience, "experiencing a candy": you can split it into minutes of a day, seconds of experience, movements of your body, thoughts caused by the candy, parts of your diet, experiences of particular people from a population and etc.
- When you consider more phenomena, you gain context. "Granularity" lets you describe one phenomenon in terms of the other phenomena. There appear consistent and inconsistent ways to distribute "granularity" between the phenomena you compare. You assign each phenomenon a specific "granularity", but all those granularities depend on each other. Vague example: if you care about "feeling of love" more than "taste of a candy", then you can't view both of those phenomena in terms of seconds of experience, because it would destroy the subjective difference between the two. Slightly more specific example: if you care about "maximizing free movement of living beings", the granularity of your value should be on the level of big enough organisms, not on the level of organs (otherwise you'd end up killing and stopping everything). I think there's a formula/principle for such type of thinking. Such type of thinking could be useful for surviving ontological crisis and moral uncertainty.
- With Bayesian inference you try to consistently assign subjective probabilities to events. With the goal to describe outcomes in terms of each other. Here you try to consistently assign subjective "granularity" to phenomena. With the goal to describe the phenomena in terms of each other.
If you care about my ideas, you can help to make a mathematical model of "granularity" in 3 following ways:
- Help to formalize universal properties of tasks, "money systems".
- Help to analyze my approach to argumentation (e.g. the reasoning from the thought experiments in the beginning of the post; or my way of comparing ideas). But I think this may be a dead end.
- Help to formalize my approach to analyzing visual information. I believe this can't be a dead end.
My most specific ideas are about the latter topic (visual information). It may seem esoteric, but remember: it doesn't matter what we analyze at all, we just need to figure out how "granularity" could work. The same deal as probability: you can find probability anywhere, if you want to discover probability you can study all kinds of absurd topics. Toys with six sides, drunk people walking around lampposts, Elezier playing with Garry Kasparov...
My post about visual information: here (maybe I'll be able to write a better one soon: with feedback I could write a better one right now). Post with my general ideas: here. If you want to help, I don't expect you to read everything I wrote, I can always repeat the most important parts.