In what language should we define the utility function of a friendly AI?

by Val 3 min read5th Apr 201522 comments


I've been following the "safe AI" debates for quite some time, and I would like to share some of the views and ideas I don't remember seeing to be mentioned yet.

There is a lot of focus on what kind of utility function should an AI have, and how to keep it adhering to that utility function. Let's assume we have an optimizer, which doesn't develop any "deliberately malicious" intents, and cannot change its own utility function, and it can have some hard-coded constraints it can not overwrite. (Maybe we should come up with a term for such an AI, it might prove useful in the study of safe AI where we can concentrate only on the utility function, and can assume the above conditions are true - for now on, let's just use the term "optimizer" in this article. Hm, maybe "honest optimizer"?). Even an AI with the above constraints can be dangerous, an interesting example can be found in the Friendship is Optimal stories.

The question I would like to rise is not what kind of utility function we should come up with, but in what kind of language do we define it.

More specifically how high-level should the language be? As low as a mathematical function working with quantized qualities based on what values humans consider important? A programming language? Or a complex, syntactic grammar like human languages, capable of expressing abstract concepts? Something which is a step above this?

Just quantizing some human values we find important, and assigning weights to them, can have many problems:


1. Overfitting.

A simplified example: imagine the desired behavior of the AI as a function. You come up with a lot of points on this function, and what the AI will do is to fit a function onto those points, hopefully ending up with a function very similar to the one you conceived. However, an optimizer can very quickly come up with a function which goes through all of your defined points and the function will not look anything like the one you imagined. I think many of us encountered this problem when we wanted to do a curve-fitting with a polynomial of too high degree.

I guess many of the safe AI problems can be conceptualized as an overfitting problem: the optimizer will exactly fulfill the requirements we programmed into it, but will arbitrarily choose the requirements we didn't specify.


2. Changing of human values.

Imagine that someone created an honest optimizer, though of all the possible pitfalls, designed the utility function and all the constraints very carefully, and created a truly safe AI, which didn't became unfriendly. This AI quickly eliminated illness, poverty, and other major problems humans faced, and created a utopian world. To not let this utopia degenerate into a dystopia over time, it also cares for maintaining it and so it resists any possible change (as any change would detract from its utility function of creating that utopia). Seems nice, doesn't it? Now imagine that this AI was created by someone in the Victorian era, and the created world adhered to the cultural norms, lifestyle, values and morality of that era of British history. And these would never ever change. Would you, with your current ideologies, enjoy living in such a world? Would you think of it as the best of all conceivable worlds?

Now, what if this AI was created by you, in our current era? You sure would know much better than those pesky Victorians, right? We have much better values now, don't we? However, for people living in a couple generations, these current ideas and values might become so much strange to them as strange the Victorian values are to us. Without judging either the Victorian or current values, I think I can safely assume that if a time traveler from the Victorian era arrived to this world, and if a time traveler from today was stuck in the Victorian era, both would find it very uncomfortable.

Therefore I would argue that even a safe and friendly AI could have the consequences of forever locking mankind to the values the creator of the AI had (or the generation of the creator had, if the values are defined by a democratic process).



We should spend some thoughts on how do we formulate the goals of a safe AI, and what kind of language should we use. I would argue that a low-level language would be very unsafe. We should think of a language which could express abstract concepts but be strict enough be able to be defined accurately. Low-level languages have the advantages over high-level ones of being very accurate, but they have disadvantages when it comes to expressing abstract concepts.

We might even find it useful to take a look at real-life religions, as they tend to last for a very long time, and can carry a core message over many generations of changing cultural norms and values. My point now is not to argue about the virtues or vices of specific real-world religions, I only use them here as a convenient example, strictly from a historical point of view, with no offense intended.

The largest religion in our world has a very simple message as one if its most important core rules: "love other people as yourself". This is a sufficiently abstract concept so that both bronze-age shepherds and modern day computer scientists understand it, and the sentence is probably interpreted not much differently. Now compare it to the religion it originated from, which has orders of magnitudes fewer followers, and in its strictest form has very strongly defined rules and regulations many of which are hard to translate into the modern world. A lot of their experts spend a considerable time to try to translate them to the modern world, like "is just pressing a single button on a washing machine considered working?". What about hygiene practices which made sense for nomadic people in the desert, how can they be understood (and applied) by modern people? Concepts expressed in a high-level language can carry their meaning much better across times with changing cultural, social and technical characteristics.

However, a rule like "on a calendar day divisible by seven you are only allowed to walk x steps" is easy to code, even many of our current robots could easily be programmed to do it. On the other hand, expressing what love is will prove to be much harder, but it will preserve its meaning and intention for much longer.