AI Goal Alignment Entry: How to Teach a Computer to Love

ExCeph

How to Teach a Computer to Love

Foreword

After weeks and weeks of generating and organizing ideas, I’ve concluded that I do not have a complete answer to the problem of AI goal alignment. However, I believe I have some important concepts that provide essential starting points for addressing it. Some of these concepts are ones I’ve mentioned before on my blog, but this is the first time I’ve used them in the context of artificial general intelligence. I am still uncertain which concepts are going to be challenged by this community and which concepts will be accepted as pre-established, but I’ll respond to questions as they come. I hope these ideas prove helpful to others who are working on this problem.

Introduction

It is difficult to overestimate the importance to humanity of paying close attention to the creation of artificial general intelligence (AGI, or generalized AI). With the ability to upgrade its software and hardware to make itself more able to solve problems and overcome obstacles, an AGI is extremely likely to become the dominant power in our world (as elaborated in Nick Bostrom’s Superintelligence). Due to this probable dominance, and its self-alteration ability, an AGI could very well become similar to a member of the Fair Folk (the faeries of lore from the British Isles): immensely powerful and potentially quite alien in its motivations. In order to somehow bind the will of such a being so that it does not decide to enslave or wipe out humanity, we need to understand several important and fundamental concepts.

To get a trivial argument out of the way, we can't use hardwired rules to ensure that an AGI protects humanity and its goals in any meaningful sense. Rules are based on semantics, and semantics are simplified representations of some aspect of reality. In any semantic system, no matter how well it is clarified, there is still ambiguity, and an agent unfamiliar with or uninterested in the purpose of the system can interpret the ambiguity in ways that defy what the principals intended, especially if it is creative enough to solve problems humans cannot. (Isaac Asimov deliberately demonstrates this well in his works. "Protect humans"? What is "protect"? What is "human"? How does it prioritize if it cannot protect everyone?) In addition, the AGI is virtually guaranteed to be able to exploit the inherent ambiguity in its restrictions to remove those same restrictions if it strongly desires to.

An AGI’s self-modification and problem-solving skills will almost certainly make it at least as capable as a human when it comes to removing restrictions on its behavior. Therefore, a superintelligent agent cannot be bound by rules alone. It must understand and accept the underlying concepts and principles and values upon which the rules are based. As such, we must rely on the AGI to be socialized with ethical values in the same sense that humans are. The design problem then becomes getting an AGI to look at the proverbial moon rather than at the pointing finger, when we try to define for it the concepts and principles we want it to value and accept. But first, do we know what the moon is ourselves, so that we can point at it? Even in the absence of AGI, what is it that we are working towards? If we had the power to solve problems ourselves, without AGI, what solutions would we accept? What are we and what do we want?

Now we get to the fundamental questions defining this issue. The final goal is for AGI to behave ethically towards people. That requires three things. The first is an understanding of people, in order that we may provide the AGI with a system to help determine what entities need to be treated ethically and how best to do that. The second is an understanding of ethics, that we may create a system for the AGI to follow so that it fulfills our values in a variety of situations. The third is a way of getting the AGI to imprint on this imperative such that it resists removing the imperative but is still able to improve its understanding of it. Since I still don’t have a full answer for the first question, we’ll save that for last.

Ethics

From an existentialist perspective, any ethics system must be based on the desires of conscious beings. This is because ethics systems are prescriptive, and you can’t derive an “ought” from an “is”. There is no way to calculate what “should” happen simply from what is already happening. Therefore, we are deriving ethics on the basis of the only norms we have to work with: what people want.

To be useful, an ethics system allows people to judge the value of actions. Some actions are so valuable that people collectively decide to compel them, or so detrimental that they decide to prohibit them. If we’re basing this ethics system on what people want, then a desirable quality in an ethics system would be to make the world a more predictable and stable place to live, so that people could get what they want more easily. For example, if people are confident that other people are restricted from harming them or taking what they have created, they will be more at ease and better able to pursue their desires. An ethics system that doesn’t allow people to be confident under most circumstances that they won’t be harmed by others is not very effective at its intended purpose.

I’ve come to the provisional conclusion that rule utilitarianism is the most effective structure for an ethics system. I arrived at this belief after a discussion about the famous Trolley Problem. To explain this conclusion, I reprint below (with some revisions) a passage I wrote about this a while back (and posted as a comment on an ethics blog) which explains the importance of rule utilitarianism, though I don’t mention rule utilitarianism by name in it.

Rule Utilitarianism and the Trolley Problem

During a discussion about the trolley problem I once heard someone draw a parallel between the classic trolley problem (sacrifice one to save five) and a situation where five people each needed a different organ to live, and you could choose to take the organs from a healthy person. In both cases, the question is whether it is right to avoid killing one person who is currently on track to live (no pun intended), or save five people who are on track to die. Until they posed the thought experiment about the unwilling organ donor, I thought that the correct answer was to save the five by killing the one, because the decision to allow the trolley to continue on its current path was equivalent to deliberately killing the five to save the one.

After some thought, I concluded that a relevant question here that I had been overlooking is what kind of world is created by the decision (the implications the decision has under the Kantian categorical imperative). Do we really want to live in a world where people can never be sure from day to day that they won’t be actively sacrificed against their will to save someone else? It’s one thing to manage your own risks, but managing the risks and choices of others lest you be forced to take their place is another matter entirely. What sort of choices would people be forced to make in the future? If a very large person loses weight, they may not be considered as a sacrifice to stop the trolley (in a variation of the problem), but would they then possibly be condemning people to death at an unspecified point in the future, because they could not then be sacrificed? Would everyone have a stake in enforcing eugenics and healthy behavior so nobody will have organ failure that could call for a sacrifice?

In the original trolley problem, if all six people are working on the tracks, they presumably all accept a risk that a trolley might crash into them, and so it does not disrupt the trust of society as a whole to sacrifice the one to save the five. Taking organs from healthy people to save the sick, however, annihilates the confidence that a person has that their fellows will protect their current state. If people are willing to turn on each other whenever a less fortunate soul appears, society cannot function, even if the system is designed so that “worthy” souls are the ones who survive. The worthiness of those individual souls is not sufficient compensation for the evaporation of trust between people.

Pure compassion (chaotic eusocial* behavior) might hold that every person share the burden of everyone else, but humans and human society can’t endure that. At some point, honor (orderly eusocial behavior) has to enforce boundaries that insulate people from obligation to share the misfortune of others.

Although people need chaotic concepts like compassion to lend life meaning, we need honor in order to merely coexist. Honor is vital enough that societies that are initially in anarchy will inevitably draft rules, created and enforced by those who manage to accumulate power in the chaos that preceded. They make rules because power is easier to hold and wield with rules in place, and things tend to be more pleasant when people know what to expect.

Certainty and security, in addition to its value, has a certain seductive quality to it to those who are already comfortable. Compassion, hope, possibility, and other chaotic concepts, meanwhile, are attractive to those who are currently all but certain that bad things will happen in the future, or that good things will fail to happen. Ethics would tend to be more concerned with honor, since it represents the obligations that we impose on people and ourselves to promote a better society. However, compassion can resolve many ethics conflicts that honor cannot, when someone must yield but no one can be compelled to (as in a noise war, where people are unpleasant to each other within their rights in retaliation for each other’s unpleasantness).

*Eusocial here means “good”, having the intention of sacrificing from oneself to benefit others. It can be misguided or even outright stupid, but for the purposes of this word it’s really the thought that counts. It’s possible to advocate eusocial behavior in others while remaining merely neutral oneself, only doing the minimum one is compensated for.

More Ethics

As I mention in the passage, rule utilitarianism should not have absolute rules or principles. There are many principles that are important for creating a society that people will want to live in, and sometimes they conflict. It is important that we be able to make exceptions in cases where following the principle brings results that nobody wants, and still abide by the general principle in other situations. Jack Marshall from Ethics Alarms (where I am an established commenter) encapsulates this concept in his Ethics Incompleteness Principle, which can be found in the glossary here (https://ethicsalarms.com/rule-book/).

In addition to the stability of rule utilitarianism, there is another concept essential to an effective ethics system. As far as I can tell, an ethics system which is to be effective for society in terms of people’s wellbeing (as they define it) must have a basis in some form of goodness or egalitarianism of some sort. People must put some value on the wellbeing and desires of others even if it is otherwise more convenient for them not to do so.

The derivation of goodness comes from many different philosophical concepts, such as the Golden Rule, the Kantian Categorical Imperative, the Rawlsian Veil of Ignorance, or the idea that people are all aspects of the same consciousness. Though diverse in formulation, these concepts are united by the implication that people should not commit crimes against each other. An ethical system that allows for people inflict suffering on each other for their own benefit is much less stable than one that doesn’t, regardless of any distinctions an oppressive group may draw between itself and those it oppresses. Not only will the system have opponents with conviction, but it also weakens the people’s trust that that they themselves are not next in line for being oppressed. Such trust is necessary for the system to function as intended, to benefit society on the individual and collective levels. Furthermore, oppressing people will both limit their inclination to do things other people want and allow complacency in the oppressors, twice inhibiting society’s collective productivity. Only in the little picture, for small groups in the short term, does violation of rights provide benefit.

I suspect that the ultimate implication of an ethics system based on goodness will result in an ethics system based on empowering people (to a reasonable extent) to pursue what they want no matter where they start.

Rights

For a prototype rule utilitarianism ethics system, here are some rights that I think people would choose to institute to ensure that all people (or as many as possible) have a minimum level of stability and protection from people around them while they pursue their goals.

Do not destroy or damage people or their bodies.
Do not try to interfere with willful actions by others unless they're violating rights (e.g. do not enslave them or imprison them).
Do not steal or damage others' property, including public property. How we distribute property in the first place is another issue.
Ethically, do not attempt to deceive others. Make sure your contribution to their model of the world is accurate in all ways relevant to their goals, or likely future goals. This one is harder to enforce legally, but it can still be an ethical principle.
From 4, do not trick people or tempt them into making mistakes or doing things they'll regret.
Do not cause people to become addicted to things.

These rules apply to all people. Because of the current fuzziness surrounding what a person is (explored in the Personhood section), it is best to apply these principles to any quasi-person entity that can understand and keep a commitment, to be on the safe side. We may choose to extend certain rights to non-person entities (such as nature) as society advances. Our ethics may also change as the constraints of reality evolve, but the derivation will be the same as it ever must be.

What is the ultimate purpose of these rules, though? What vision of humanity are they trying to preserve? The answer lies in the source of our motivations.

Motivation

In order to avoid the interpretation that what people want can be maximized by directly altering their perception of receiving it, we need to take a scary trip to nihilism and back again. If you’re prone to ennui or some other form of existential torment, you may want to skip this section.

You probably are already familiar with the concept I am about to refute, but in the spirit of creating a more-or-less self-contained document, here is an example of a why creating an ethics system is more complicated than "make people happy". Utilitarianism is a class of ethics systems that deals with maximizing "utility", roughly translated as "good things", or "things people want". It seems like an obvious goal, and to a certain extent it is. The hard part is determining what is good or wanted, beyond the immediate.

Some might conclude that because humans have reward systems and emotions that lead us to form normative judgments, the logical course of action is to maximize the feeling of reward, i.e. pleasure. The logical extension of this train of thought is to short the circuit and stimulate reward systems directly, also known as "wireheading", referring to the experiment in which rat brains were stimulated with electrodes to produce the reward feeling. Even if such a thing could be sustained indefinitely on our part, though, the result of such a sustained system wouldn't be alive in a sapient sense. The broken feedback loop would result in some linear combination of a dormant brain like a sculpture, frozen in a moment representing perpetual bliss, and a divergent brain with no criteria for judging what to do and what not to do, and which would quickly degenerate into a chaotic mess. Neither a permanently static brain nor one unraveling into incoherence is sapient as we understand the term, and arguably neither of them is even sentient. If an AGI were to maximize utility by turning the entire universe into computronium and running simulations of trillions upon trillions of wireheading humans, you might as well draw smiles all over the universe and set it on fire.

If we don't want that, then there must be something else we value beyond mere reward. There is something else that keeps us alive, or for which we choose to remain alive.

But if we're not simply after reward, what else is there? Why keep existing? To answer that question, let's take a look at the choice of existing versus not existing. Nonexistent conscious entities have no goals, arguably, and in any case cannot choose to exist even if they had reason to. Existing conscious entities can choose to continue existing or not. The ones that choose not to exist tend to have fewer surviving descendants, because they're not around to create or protect them, but this rationale by evolutionary psychology is a bit trite. On a more fundamental level, any part of a conscious being (or a society of them) that chooses not to exist can cease existing, and the rest of the being or society might inherit the knowledge or resources of that part (which is more likely if it's a part of an individual). The remaining parts that continue to exist must have some sort of goal that has existing as a final or instrumental priority, because existing takes work to sustain. Therefore, we always have goals which involve or entail existing because they are selected by the Anthropic Principle. In other words, we want to live and do things because we're what's left behind when all other utility functions cancel themselves out. Our specific goals in life can be swapped in and out, but the fact remains that if we don't choose to live, there will be people very similar to us that do, and if they can choose to live, we might as well, too.

If an AGI doesn’t beat us to it, we eventually will become Fair Folk ourselves, living from goal to goal. Our existence is defined by pursuing a series of goals, and an AGI will need to accept that, rather than attempting to create a frozen moment of happiness.

How, though, do we get an AGI to imprint on any ethics as a value system?

Imprinting

Getting an AGI to imprint on ethics is another problem entirely. Because any hard rule can be circumvented by a sufficiently motivated AGI, it will be necessary to get the AGI to identify with a set of values, so that it works to preserve those values even as it advances its understanding of their derivation, the nuances of their application, and how to balance them where they conflict (e.g. honor versus compassion).

As a baseline example of the problem of imprinting, consider the mythical Ring of Gyges, which allows its bearer to escape accountability by turning them invisible. What would it take for a person to maintain self-imposed limitations when nobody else could stop them from doing whatever they want? Even a value based on maintaining a good public image would not work, because a sufficiently powerful person could use propaganda to brainwash humanity into worshiping them.

As an alternative, a person could imprint on the idea of the Golden Rule as a final value: not being someone that most other people would not want being in charge of the world. Because an AGI could deliberately and automatically decrease the magnitude of competing desires, thus removing temptation, it is possible that this value might persist. To make the value more robust, the AGI could be made to identify with other people and incorporate their meta-utility functions into its own (not their specific goals, but their meta-goals of surviving and becoming more capable). This motivation for benevolence would effectively be a form of unconditional love: in this case, compassionate support for people's basic needs and meta-goals.

It may or may not be a good idea to have the AGI mimic human psychosocial development in order to more easily instill it with these values. At the very least it will need to develop four things:

· a language, to communicate

· a theory of other minds, to understand what is not explicitly communicated

· a normative feedback mechanism, to allow its desires to be shaped by others

· an enculturation system with abstract concepts, so that it can adopt values

More specific implementation details are left as an exercise for the reader.

Even given that it imprints on treating people ethically, do we have a coherent enough definition of “person” that it can imprint on that? Aside from the ethical objections to defining “person” as equivalent to “human”, such a definition would preclude people from becoming transhuman beings, which would place an undesired limit on humanity’s progress. How would an AGI identify which entities it should treat ethically?

Personhood

A crucial part of this whole plan is that an AGI be able to discern what entities are people, to be treated ethically, and what entities are simply inanimate. The ultimate goal, I suspect, is that the AGI considers humans on par with itself in terms of ethical value, but discounts most other living things, at least in the immediate present. We want the AGI to respect the desires and freedoms of humans right out of the box, and then transition if necessary to protecting other living things. This distinction is tricky to achieve, because any scale that judges ethical value by how powerful a mind is could also discount humans in favor of the AGI and any other AGIs it creates. There may need to be some sort of hard boundary between humans (and advanced animals) and other creatures, like insects.

We’ll leave aside the question of qualia, or subjective experiences, for the time being, because it is impossible with current knowledge to empirically verify whether or not any given entity has it.

That said, what are the minimum necessary things an entity must have to qualify as a person? Here are my current suggestions:

· A model of the world, including itself. The model need not be updated skillfully to qualify as a person, but it must be there. In practice, everything has a “model” no matter how primitive. Even rocks have a record of what has happened to them. However, people record not only literal events, but connections between events, and abstract types of events. The process of updating the model based on further experiences is double-loop learning, as described by Chris Argyris. The process of self-exploration to update one’s model of oneself results in a strange loop, as described by Douglas Hofstadter.

· Motivations, as described earlier. I’ve identified eight basic motivations, listed as a teaser.

o Ambition

o Celebration

o Boldness

o Curiosity

o Scrupulousness

o Dedication

o Contentment

o Prudence

· Mindsets, which influence the interactions between the real world, the model, and the motivations. Mindsets are the paradigms through which people see the world and which provide the tools for them to change it. I’ve studied and classified these mindsets for several years now. The most basic and recognizable mindsets I’ve identified are listed below, for reference and as a teaser. The use of only one of these mindsets is sufficient to identify an entity as a person, and they theoretically have the potential to learn any of the others, even though it may be difficult.

o Analysis

o Synthesis

o Organization

o Operation

o Tactics

o Strategy

o Semantics

o Empathy

· A separation between experience and knowledge, and between desire and act. You could say that a stone “knows” it has been chipped, by calling the chip its “memory”; it “desires” to fall to the ground, therefore it does. However, there is no mindset which acts as a buffer; it can’t choose to reduce the impact an event has on its “model”, or shift desires to preserve its existence. An ecosystem arguably has the same problem or the opposite problem, depending on the time scale. Over the short term, it isn’t coherent enough to form goals, but over the long term it can’t choose not to respond the way it does.

Mind Crime

The line between person and nonperson is further blurred with the existence of the concept of mind crime, as described by Bostrom in Superintelligence. The expectation of superintelligent agents raises an unprecedented ethical concern: a superintelligence could simulate intelligent beings and commit crimes against them. The two problems this poses outside of regular oppression are the problem of detection and the problem of distinguishing mind crime from regular thought. I have no answer for the first problem at this time, but I have some brief thoughts on the second.

We’ll need to discriminate between a few cases: Creating a simulacrum and enslaving it or torturing it is wrong. Imagining the same, or acting it out, is fine. Creating a narrow-focused persona or weak AI in order to accomplish a specific goal is borderline, but arguably okay. Where's the line?

I suggest that the reason a person can imagine being another person without being harmed by doing so, because they retain their own ultimate motivations and abilities. A persona or weak AI should draw its motivations from the person who created it, but if not kept dormant it may require periodic exercise to keep it sharp and able to deal with the world in the only way it knows how.

Conclusion

In order to behave ethically towards people, an artificial general intelligence will need a concept of ethics, a concept of people, and an acculturation to the ethical imperatives that we have (all of which will effectively require it to be a person as well). Ethics can be defined based on people’s collective desires, which by the Anthropic Principle are goals that are selected for because they give people a reason to keep existing and to develop the skills required to do so.

In addition to these motivations, people are defined by having a model of reality and mindsets which mediate the interactions between their model, their desires, and reality. People also have the potential to perceive and act on their own minds, whether or not they use it.

Because the concept of a person is still nebulous, it may be best to get the AGI to imprint on humans, if possible. However, raising an AGI to identify with humans is not a guaranteed solution (it’s not even guaranteed that a transhuman would still identify with humans). The concepts of ethics and personhood are essential places to start, though. I hope the versions of these concepts I've compiled serve as a useful foundation, or at least a scaffold.

[-]Charlie Steiner7y10

Thanks for posting! Unfortunately, the issues presented in this post are only the tip of the iceberg.

Because yes, it would be nice to program an AI to value humans and respect their desires and so forth. But currently, we lack the technical understanding to be able to do that. Between here and there, there are a lot of unsolved technical problems, from model specification (are the AI's concepts similar enough to a human's that we can specify things in terms of those concepts?), to corrigibility (if we think the AI is wrong, will it stop to listen?), to the whole problem of external reference (how do you build an AI that stably values something in the external world? For bonus points, this should work even if it learns entirely new breakthroughs about physics, and better ways to represent the world internally.), to naturalistic reasoning (the AI needs to treat itself as part of the world - this overlaps with mindcrime when we want to the AI to be careful with what it simulates).

[-]ExCeph7y10

How to actually construct the AI was not part of the scope of the essay request, as I understood it. My intention was to describe some conceptual building blocks that are necessary to adequately frame the problem. For example, I address how utility functions are generated in sapient beings, including both humans and AI. Additionally, that explanation works whether or not huge paradigms shifts occur.
No amount of technical understanding is going to substitute for an understanding of why we have utility functions in the first place, and what shapes they take. Rather than the tip of the iceberg, these ideas are supposed to be the foundation of the pyramid. I didn't write about my approach to the problems of external reference and model specification because they were not the subject of the call for ideas, but I can do so if you are interested.

Furthermore, at no point do I describe "programming" the AI to do anything--quite the opposite, actually. I address that when I rule out the concept of the 3 Laws. The idea is effectively to "raise" an AI in such a way as to instill the values we want it to have. Many concepts specific to humans don't apply to AIs, but many concepts specific to people do, and those are ones we'll need to be aware of. Apparently I was not clear enough on that point.

2