# 7

Personal Blog

I've been following the "safe AI" debates for quite some time, and I would like to share some of the views and ideas I don't remember seeing to be mentioned yet.

There is a lot of focus on what kind of utility function should an AI have, and how to keep it adhering to that utility function. Let's assume we have an optimizer, which doesn't develop any "deliberately malicious" intents, and cannot change its own utility function, and it can have some hard-coded constraints it can not overwrite. (Maybe we should come up with a term for such an AI, it might prove useful in the study of safe AI where we can concentrate only on the utility function, and can assume the above conditions are true - for now on, let's just use the term "optimizer" in this article. Hm, maybe "honest optimizer"?). Even an AI with the above constraints can be dangerous, an interesting example can be found in the Friendship is Optimal stories.

The question I would like to rise is not what kind of utility function we should come up with, but in what kind of language do we define it.

More specifically how high-level should the language be? As low as a mathematical function working with quantized qualities based on what values humans consider important? A programming language? Or a complex, syntactic grammar like human languages, capable of expressing abstract concepts? Something which is a step above this?

Just quantizing some human values we find important, and assigning weights to them, can have many problems:

## 1. Overfitting.

A simplified example: imagine the desired behavior of the AI as a function. You come up with a lot of points on this function, and what the AI will do is to fit a function onto those points, hopefully ending up with a function very similar to the one you conceived. However, an optimizer can very quickly come up with a function which goes through all of your defined points and the function will not look anything like the one you imagined. I think many of us encountered this problem when we wanted to do a curve-fitting with a polynomial of too high degree.

I guess many of the safe AI problems can be conceptualized as an overfitting problem: the optimizer will exactly fulfill the requirements we programmed into it, but will arbitrarily choose the requirements we didn't specify.

## 2. Changing of human values.

Imagine that someone created an honest optimizer, though of all the possible pitfalls, designed the utility function and all the constraints very carefully, and created a truly safe AI, which didn't became unfriendly. This AI quickly eliminated illness, poverty, and other major problems humans faced, and created a utopian world. To not let this utopia degenerate into a dystopia over time, it also cares for maintaining it and so it resists any possible change (as any change would detract from its utility function of creating that utopia). Seems nice, doesn't it? Now imagine that this AI was created by someone in the Victorian era, and the created world adhered to the cultural norms, lifestyle, values and morality of that era of British history. And these would never ever change. Would you, with your current ideologies, enjoy living in such a world? Would you think of it as the best of all conceivable worlds?

Now, what if this AI was created by you, in our current era? You sure would know much better than those pesky Victorians, right? We have much better values now, don't we? However, for people living in a couple generations, these current ideas and values might become so much strange to them as strange the Victorian values are to us. Without judging either the Victorian or current values, I think I can safely assume that if a time traveler from the Victorian era arrived to this world, and if a time traveler from today was stuck in the Victorian era, both would find it very uncomfortable.

Therefore I would argue that even a safe and friendly AI could have the consequences of forever locking mankind to the values the creator of the AI had (or the generation of the creator had, if the values are defined by a democratic process).

## Summary

We should spend some thoughts on how do we formulate the goals of a safe AI, and what kind of language should we use. I would argue that a low-level language would be very unsafe. We should think of a language which could express abstract concepts but be strict enough be able to be defined accurately. Low-level languages have the advantages over high-level ones of being very accurate, but they have disadvantages when it comes to expressing abstract concepts.

We might even find it useful to take a look at real-life religions, as they tend to last for a very long time, and can carry a core message over many generations of changing cultural norms and values. My point now is not to argue about the virtues or vices of specific real-world religions, I only use them here as a convenient example, strictly from a historical point of view, with no offense intended.

The largest religion in our world has a very simple message as one if its most important core rules: "love other people as yourself". This is a sufficiently abstract concept so that both bronze-age shepherds and modern day computer scientists understand it, and the sentence is probably interpreted not much differently. Now compare it to the religion it originated from, which has orders of magnitudes fewer followers, and in its strictest form has very strongly defined rules and regulations many of which are hard to translate into the modern world. A lot of their experts spend a considerable time to try to translate them to the modern world, like "is just pressing a single button on a washing machine considered working?". What about hygiene practices which made sense for nomadic people in the desert, how can they be understood (and applied) by modern people? Concepts expressed in a high-level language can carry their meaning much better across times with changing cultural, social and technical characteristics.

However, a rule like "on a calendar day divisible by seven you are only allowed to walk x steps" is easy to code, even many of our current robots could easily be programmed to do it. On the other hand, expressing what love is will prove to be much harder, but it will preserve its meaning and intention for much longer.

# 7

New Comment

Determining the language to use is a classic case of premature optimization. No matter what the case, it will have to be provably free of ambiguities, which leaves us programming languages. In addition, in terms of the math of FAI, we're still at the "is this Turing complete" sort of stage in development. So it doesn't really matter yet. I guess one consideration is that the algorithm design is going to take way more time and effort than the programming, and the program has essentially no room for bugs (Corrigibility is an effort to make it easier to test an AI without it resisting). So in that sense, it could be argued that the lower level the language, the better.

Directly programming human values into an AI has always been the worst option, partially for your reason. In addition, the religious concept you gave can be trivially broken by two different beings having different or conflicting utility functions, and so acting as if they were the same is a bad outcome. A better option is to construct a scheme so that the smarter the AI gets, the better it approximates human values, by using its own intelligence to determine them, as in coherent extrapolated volition.

Besides programming languages, we have Lojban. It is free of ambiguities in the grammar level, and there are people trying to create a script programming language in Lojban.

Regarding "changing of human values".

There are two basic theories of moral progress (although it is possible to form mixtures):

1. Morals change over time in some random or arbitrary fashion, without any objective "improvement". The only reason morals seem to improve over time is because along the past timeline, as time progresses it becomes closer to our time.

2. Morals improve over time in some objective sense. That is, the change in morals is the result of better epistemic knowledge and more introspection and deliberation.

If you believe in theory 1 then there is nothing wrong with locking the values. Yes, future generations would have different morals without the lock, so what? I care about my own values, not the values of future generations (that's why they're called my values).

If you believe in theory 2 then the AGI will perform the extrapolation itself (as in extrapolated volition). It will perform it much better than us since it will gather a lot more epistemic knowledge and will be able to understand our brains much better than we understand them.

If theory 2 was correct, the AI would quickly extrapolate to the best possible version, which would be so alien to us that most of us would find it hellish. If it changed us so that we would accept that world, than we would no longer be "us".

This reminds me of the novel Three worlds collide. (wow, I've read it quite some time ago and never realized until now that it was originally posed here on LessWrong!)

Humans make first contact with an alien species which have their morals based on eating most of their fully conscious children (it makes sense in context). Of course humans find it most ethical to either genocide them or forcefully change them so that the suffering of children can be stopped... but then we encounter another species which eliminated all kind of pain, both emotional and physical, and finds us just as abhorrent as we found the "baby-eaters". And they want to change us into blobs which don't feel any emotions and exist in a blissful state of constant orgy.

Presumably the extrapolated values will also include preference for slow, gradual transitions. Therefore the AI will make the transition gradual.

If you believe that morals improve over time in some objective sense, then you should examine how morals change by the passage of time, figure out which morals are going to win, and adopt those morals immediately.

Yes, FAI problem is an overfitting problem; you have a vast space of parameters (all the algorithms, if you consider the general case).

No matter how well you specify your values, your description can be hacked if it is expressed in a language which has enough room for different interpretations of the same expression. Even programming languages probably aren't strict enough, since they sometimes have "implementation-specific" or "undefined behavior" expressions. So when you say "high-level", you should mean "high-level as Haskell", not "high-level as natural languages".

And then good luck defining your values, e.g. love, in a programming language.

Has there been any research into the development of languages specifically for this purpose? I would still consider the highest-level programming languages as not high level enough.

Maybe I am wrong, and it can turn out that either the highest level programming languages are too high for the purpose, or that a language close to natural languages is impossible to create without super-intelligence, both of these could falsify my assumptions. However, is there significant research or at least debate regarding these issues?

EDIT: I'm not here to offer a readily usable solution of any kind. My goal was just to better define some concepts and present some issues, which might already have been presented, but maybe not in this shape, form, or context.

The Urbit system uses an extremely simple virtual machine as it's core in ircerto remove semantic ambiguity, Vat the aim is more about security.

http://doc.urbit.org/

Declarative languages, maybe? like Prolog.

People certainly have motivation to create programming languages as high-level as possible, as using such languages reduces development costs. So there are languages which are more or less directly optimized for high-levelness, like Python.

On the other hand, programming languages design is limited by the fact that programs on it must actually work on real computers. Also, most effort in creating programming languages is directed to imperative ones, as they are usually more convenient for programming purposes.

But still, programming languages seem to be the humanity's best effort at creating rigorous enough languages. There are other approaches, like logical artificial languages (e.g. Loglan), but I think they are still too imprecise for FAI purposes.

I agree, however I still think that because programming languages were developed and optimized for a different purpose than defining utility functions for AIs, there might be other languages somewhat better suited for the job.

If this line of thinking proves to be a dead end, I would remove some parts of the article and focus on the "changing values" aspect, as this is an issue I don't remember seeing in the FAI debate.

You're asking the wrong question - partly because of confusion over the term 'utility function'.

We want the AI to embody human values through a utility function that is a reasonable approximation to the hypothetical ideal human group utility function that some large organization of humans (or just - humanity) would encompass if they had unbounded amounts of time to reach consensus on the actions the AI takes.

That ideal utility function is - for practical purposes - impossible to define directly or hand engineer - its far too complex.

To illustrate why, consider the much simpler problem of a narrow AI that just recognizes images - a computer vision system. The vision AI takes an image as an input and then produces an action output. The ideal utility function over input,output pairs is again defined by the action a committee of humans would take given enough time. We don't actually hand engineer the decision utility function for vision: again its too complex. Instead the best approach is to define the vision system's utility function indirectly, based on labeled examples. Defining the system's goals that way leads to a tractable inference problem with a well defined optimization criteria.

The same general approach can scale up to more complex AGI systems. To avoid the need for huge hand labeled training datasets, we can use techniques such as inverse reinforcement learning where we first use an inference procedure to recover estimations of human utility functions. Then we can use these recovered utility functions in a general reinforcement learning framework as replacement for a hardwired reward function (as in AIXI).

So, in short, the goals of any complex AGI are unlikely to be explicitly written down in any language - at least not directly. Using the techniques described above, the goals/values come from training data collected from human decisions. The challenge then becomes building a training program that can significantly cover the space of human ethics/morality. Eventually we will be able to do that using virtual reality environments, but there may be even easier techniques involving clever uses of brain imaging.

I can agree with some of your points, but interestingly, many commenters prefer a very rigorously defined utility function defined in the lower possible language instead of your heuristically developed one, because they argue that its exact functionality has to be provable.

The types of decision utility functions that we can define precisely for an AI are exactly the kind that we absolutely do not want - namely the class of model-free reward functions. That works for training an agent to play atari games based on a score function provided by the simulated environment, but it just doesn't scale to the real world which doesn't come with a convenient predefined utility function.

For AGI, we need a model based utility function, which maps internal world states to human relevant utility values. As the utility function is then dependent on the AGI's internal predictive world model, you would then need to rigorously define the AGI's entire world model. That appears to be a rather hopelessly naive dead end. I'm not aware of any progress or research that indicates that approach is viable. Are you?

Instead all current research progress trends strongly indicate that the first practical AGI designs will be based heavily on inferring human values indirectly. Proving safety for alternate designs - even if possible - has little value if those results do not apply to the designs which will actually win the race to superintelligence.

Also - there is a whole math research tract in machine learning concerned with provable bounds on loss and prediction accuracy - so it's not simply true that using machine learning techniques to infer human utility functions necessitates 'heuristics' ungrounded in any formal analysis.

Language development seems to me an integral part of building an AGI. The AGI has to reason over concepts and needs to represent them in some way.

Languages like CycL or common logic, are likely to be nearer to what you are seeking then natlangs.

My wife tells me there are four love languages and I'm not very good at speaking her's.

This so needs to be quad lingual.

According to the standard idiom there are five love languages: (I might got the order wrong)

1) Words of Appreciation
2) Acts of service
3) Physical Touch
4) Quality time