Please forgive me for being awkward on my first post, I just wanted to get these ideas out there to facilitate discussion on the topic. I also am sorry if I use any newly-discovered terminology incorrectly, but I am doing my best.

Introduction: Utilitarianism is Unaligned

Intuitively, I think that an important component of modeling a moral system for friendly AGI is to discuss what ethical philosophy such a system would be built off of.

From my experience, it seems that a lot of people who discuss a hypothetical moral system for friendly AI presumes that such a system would work on a utilitarian basis. For an array of generated policies, each policy is represented by some metric that represents the collective good of humanity or civilization, and the AI selects the policy where this value is maximized.

I feel that an approach like that tends to be naïve, and isn't an expectation we actually hold for humans in the real world, much less what we expect from an AI. From what I understand, Utilitarianism has been criticized for as long as it has been formally stated, despite such famous adages like "the needs of the many outweigh the needs of the few". Such thought experiments as the Trolley Problem are meant to give us the impression that killing n people is permissible if it allows n+1 people to survive, but this doesn't align with our moral conscience. This also doesn't work for a superintelligent AI, because a policy that will sacrifice 49% of the population for the benefit of 51% of the population should be totally unacceptable.

In reality, we want a moral system that says killing people is always wrong, at any time, for any reason, and under any circumstance. No matter how cosmically intelligent the AI is, there is no scenario where such a sacrifice is appropriate. Thus, a mathematical model based in Utilitarianism, while making sense on paper, is ultimately unaligned from our actual intentions in terms of morality.

Deontology as an Ethical Foundation for FAI

Now, I am aware there are more ethical systems out there besides Utilitarianism and Deontology, but I just wanted to outline here some thoughts I had on how utility functions based on a Deontological model could be constructed. I haven't seen anyone else on the site use this approach, but if there is any existing literature out there that discusses Deontological models for friendly AI, I would really like to read more about it.

An AGI that uses a Deontological model for morality is very different from simply slapping on a set of rules that the agent can or cannot do. Rather, the goal here is to construct an algorithmic model that is analogous (although not identical) to ethical systems used by humans in the real world.

Quite a lot of humans use a Deontological approach for our sense of morality, either deliberately or unconsciously. Religious individuals follow morality codes sacred to their beliefs. Non-religious individuals follow their personal moral beliefs that derive from informal norms, traditions or mores. Democratic governments are bound by constitutions and international law, and social media sites have terms of service and policies, etc. So from that perspective, it makes perfect sense why we would want a superintelligent AI to be bound by a Deontological system.

One Possible Algorithm for a Deontological Model

Obviously, designing such a Deontological model could be an entire field of research on its own, but I'd like to outline my own rough idea what the process might look like. First, we construct a series of commandments that the AI should adopt as its absolute moral code, say Asimov's Laws for example. Now, suppose the AI has been given a task, and its algorithms generate an array of possible policies, each of which is associated with a Q-Table of state-action pairs. Then, in consideration of Asimov's First Law, the AI will filter these policies in the following manner:

  • Any policy that contains a state-action pair that brings a human closer to harm is discarded.
  • If at least one policy contains a state-action pair that brings a human further away from harm, then all policies that are ambivalent towards humans should be discarded. (That is, if the agent is a aware of a nearby human in immediate danger, it should drop the task it is doing in order to prioritize the human life).
  • This kind of filter would be iterated through any other commandments the AI is programmed with.
  • For the policies that remain, they can be processed with the normal utility functions of Reinforcement Learning.
  • If the array of policies is empty, then cancel the entire operation and return an error.

Now, this above algorithm is certainly rough around the edges, and a lot of special edge cases would have to be examined. For example, one would have to determine a threshold to define what "immediate danger" would be. A human standing in a subway station has a non-zero probability of falling into the gap, even if he is standing 100 feet away from it. But for the AI, dragging the human 101 feet away from the gap would technically be bringing him further away from harm. So we would have to set some small positive value, say eta, such that any probability of harm less than eta can be forgiven.

Another possible issue is that the Deontological model itself could be accidentally altered as the AI evolves and rewrites its own code. I believe that an AGI should be limited in what code it can alter for the sake of normal optimization problems. Perhaps the utility functions related to morality should be located in a separate hardware module altogether.

Possible Counter-Arguments

Now, on the website the closest I saw to a counter-argument against a Deontological model is found in the Superintelligent FAQ, under the question "Can we specify a code of rules that the AI has to follow?" The crux of the counter-argument is described in this quote:

Suppose the AI chooses between two strategies. One, follow the rule, work hard discovering medicines, and have a 50% chance of curing cancer within five years. Two, reprogram itself so that it no longer has the rule, nuke the world, and have a 100% chance of curing cancer today.

I will take a moment to address this argument, even though it is based on a very different scenario (slapping on some rules ad hoc verses constructing a moral system based on Deontology). There are two reasons why I consider this scenario would be very implausible in a Deontological system.

First, in order for the AI to come to this conclusion, one presumes that it is weighing its options against a policy that includes nuking humans. But, because the Deontological commandments are still in place, then this policy has already been disregarded as soon as it was generated. Thus, the AI should not be capable of weighing the option in the first place.

In fact, one could be extra precautious and include a Deontological commandment that forbids the AI from altering its own moral system. Because all policies that involve altering its own moral system are being disregarded, then the AI should be incapable of even conceiving of a world in which its moral system does not exist.

To further build on this point, it is possible that a more advanced intelligence would start to realize that constructing a policy that involves harming humans is essentially a waste of resources, because such a policy is being discarded anyway. Thus, it is possible that the next evolution of AI would opt to not even generate such a policy in the first place.

Second, this scenario presumes that the AI prioritizes the completion of its task ahead of the Deontological commandments, which is a bad design. There is a reason why "obey the orders of a human" is Asimov's Second Law, and not the First.

Let's forget about cancer and nukes for a second, and instead imagine that you order a robot to pick up a gun and shoot someone. In order to achieve alignment, we don't want the robot to figure out some way of circumventing its moral code to accomplish this task. Rather, we expect the robot to disregard the order entirely, and report an error to the user. In other words, the primary function of AI should be to uphold the moral code it is designed with, while it is only the secondary function to accomplish the goals that humans set for it.

Redefining "Tool AI"

Now, the kind of Deontological commandments needed for an AI are distinctly different from the kinds of moral obligations humans have for each other. This, I believe, is something that distinguishes a sentient being (such as a human person) from a superintelligent tool (i.e., an FAI).

This is tangentially related to Karnofsky's argument for "Tool AI", but I would define the term a little differently. Karnofsky seems to distinguish between an algorithmic tool and an autonomous agent, using the example of Google Maps that plots a route but doesn't move the car for you.

However, in my conception an autonomous agent can still be a tool. Take, for example, a self-driving car. It can calculate the optimum route to your destination and take you there, but it is still a tool because it is merely serving the function that it was designed for. The car doesn't hesitate to consider why the user wants to go to this location, nor does it consider whether doing so will make the user happier or healthier. It understands the task and accomplishes it without question.

In other words, a sentient being acts upon its own spontaneous desires, whereas a tool has no desires outside of the functions it's designed for. It is my belief that a superintelligent AI, no matter how advanced, must always fall into the latter category, and purely exist at the pleasure of humanity.

I don't believe an AI should cooperate with humans because it ran some calculation and decided that cooperation was the dominant strategy. Rather, it should cooperate with humans simply because that is what it is designed to do. As said in that famous video game, "a man chooses, a slave obeys". Or as another analogy, the Catholic catechism indicates the relationship between the Creator and the creation: "What is the purpose of man? To love God and enjoy Him forever".

There is a certain notion I get from certain people who believe that a superintelligent AI should be allowed to do whatever it feels is best for humanity, even when humans don't understand what is best for ourselves. I believe this is also a bad design, because an AI that no longer acts like a tool is violating humanity's control over our own destiny. Woe betide our civilization if this overlord AI was smart enough to control the world, but not wise enough to keep it safe, and humanity was too myopic to question these unwise decisions.

I would rather see humanity destroy itself, just to know that it was our own fault and under our own responsibility, than to leave open the possibility for humanity to be destroyed by an entity we had no control over.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 3:41 AM

I think deontology is not about making decisions, but rather about carrying them out. It's deference, an aspect of corrigibility that's not about attainment of goals. For an updateless agent, there is a problem of predictably following the rules/principles set out in the logical past (so that the past can anticipate that fact and plan for the consequences of future's compliance), and deontology captures the spirit of that activity much better than consequentialism.

Decisions are governed by preference that could be attained in a corrigible way. Actions are governed by rules that could be attained in a deferential way. Decisions are consequentialist, determined by preference looking at consequences, ignoring the local moral/epistemic state (this is updatelessness). Actions are deontological, determined by rules looking at local moral/epistemic state, ignoring the consequences (a kind of myopia).

If you are interested, here is Eliezer Yudkowsky's moral theory for AI:

  • Any policy that contains a state-action pair that brings a human closer to harm is discarded.
  • If at least one policy contains a state-action pair that brings a human further away from harm, then all policies that are ambivalent towards humans should be discarded. (That is, if the agent is a aware of a nearby human in immediate danger, it should drop the task it is doing in order to prioritize the human life).

This policy optimizes for safety. You'll end up living in a rubber-padded prison of some sort, depending on how you defined "harm". E.g. maybe you'll be cryopreserved for all eternity. There are many things people care about besides safety, and writing down the full list and their priorities in a machine-understandable way would solve the whole outer alignment problem.

When it comes to your criticism of utilitarianism, I don't feel that killing people is always wrong, at any time, for any reason, and under any circumstance. E.g. if someone is about to start a mass shooting at a school, or a foreign army is invading your country and there is no non-lethal way to stop them, I'd say killing them is acceptable. If the options are that 49% of population dies or 51% of population dies, I think AI should choose the first one.

However, I agree that utilitarianism doesn't capture the whole human morality, because our morality isn't completely consequentialist. If you give me a gift of 10$ and forget about it, that's good, but if I steal 10$ from you without anyone noticing, that's bad, even though the end result is the same. Jonathan Haidt in "The Righteous Mind" identifies 6 foundations of morality: Care, Fairness, Liberty, Loyalty, Purity and Obedience to Authority. Utilitarian calculations are only concerned with Care: how many people are helped and by how much. They ignore other moral considerations. E.g. having sex with dead people is wrong because it's disgusting, even if it harms no one.

I mostly think of Eliezer's reply to Holden on "Tool AI" as missing the point (though it's a precursor to later arguments that make more sense). However, in this case, I think it's got a lot of stuff you might need to internalize.

Obviously, this approach would rule out a large number of use cases for AI wherein it may need to make things much better, but it risks hurting someone else. It is thus not a fully general solution, but it could be useful anyway for many other uses.

Another, more pressing problem is that a tool AI is likely to not be sophisticated enough to realize what the actual likelihoods its actions have to harm humans are. For instance, a language model can't conceive of  the harm its words could cause (assuming they actually advance enough as to pose actual danger.).

This seems more like something to make an AGI that acts like a tool AI. It could be interesting to attempt that if the time comes, but it seems to have that little disconnect. Giving extra capabilities to an AI in order to reduce the risk it does something unexpected seems like a plan with a reasonably high chance of failure.

For use on actual tool AI, it seems we would be stuck with whatever narrow rules humans can come up with for it, but 'harm' is very difficult to define in rules. (A utilitarian making the rules might very well argue that it would check to see if it caused negative utility to any humans, and that would be the definition of 'harm', but it is not widely agreed to, especially deontologically, and it still has the issue of not knowing the values of the people it is affecting.)

I do agree with the sentiment that it is better to have humanity itself be the risk case, rather than an alien intelligence we don't know how to deal with. (I am personally very skeptical that there is any real danger of super-powerful AI any time soon though.)