This is my first post at this forum, and I am going to address a confusion that I believe some people have, and that may hinder some beginners from being able to contribute helpful ideas to the AI safety debate.

Namely, the confusion is the idea that we can choose an AGI's utility function, and that as a consequence, that the biggest problem in AI alignment is figuring out what utility function an AGI should have.

Because I have not seen anyone argue for that explicitly, I'm going to add some quotes to documents that beginners are expected to read, and I'll argue why someone reading these quotes might confused as a result.

 

Is it easy to give an intelligent machine a goal?

Here is from MIRI's Four Background Claims:

Regardless of their intelligence level, and regardless of your intentions, computers do exactly what you programmed them to do. If you program an extremely intelligent machine to execute plans that it predicts lead to futures where cancer is cured, then it may be that the shortest path it can find to a cancer-free future entails kidnapping humans for experimentation (and resisting your attempts to alter it, as those would slow it down).

Computers do indeed do exactly as you programmed them to do, but only on the very narrow sense of executing the exact machine instructions you give. 

There are clearly no instructions for executing "plans that the machine predicts will lead to futures where some statement X is true" (where X may be "cancer is cured", or something else).

Unless we have fundamental advances in alignment, it will likely not be possible to give orders to intelligent machines like that. I will go even further and question whether, among the order-following AGIs, the ones that follows orders literally and dangerously are any easier to build. As far as I am aware of both may be actually equally difficult to come up with.

If this is so, then the cautionary tale sounds a little off. Yes, they might help us recognize how hard it is for us to know precisely what we want. And yes, they might serve to impress people with easy-to-imagine examples of AGIs causing a catastrophe.

But it may also enshrine a mindset that the AI alignment problem is mostly about finding the 'right' utility function for an AGI to have (that takes into account corrigibility, impact metrics and so on), when we actually have no evidence this will help.

Here is a similar example from Nick Bostrom's Superintelligence:

There is nothing paradoxical about an AI whose sole final goal is to count the grains of sand on Boracay, or to calculate the decimal expansion of pi, or to maximize the total number of paperclips that will exist in its future light cone. In fact, it would be easier to create an AI with simple goals like these than to build one that had a human-like set of values and dispositions.

Again, I understand it might be easier to create an AGI with one of these "simple" goals, but it is nowhere obvious to me. How exactly it is easier to create a very intelligent machine that really wants to calculate the decimal expansion of pi?

What would be easy would be to create an algorithm that does calculate the digits of pi, without being intelligent or goal-oriented, or maybe an optimization process that has the calculation of the decimal expansion of pi as a Base Objective. But do we get to have any kind of control over what the Mesa-optimizer itself really wants?

For example, in Reinforcement Learning we can assign a "goal" to an agent in a simulated environment. This, together with a learning rule, acts is an optimizer that searches for agents who reach our goals. However, the agents themselves are often either dumb (following fixed policies) or end up optimizing for proxies of the goals we set.

Designing an agent with specific utility functions can be even harder if efficient training requires some sort of intrinsic motivation system in order to enable deeper exploration. Such systems provide extra reward information, which may be especially useful if the external signals are too limited to allow efficient learning on their own. These mechanisms may facilitate learning in the desired direction, but at the cost of creating goals independent of the external reinforcement, and thus unaligned to the base objective.

 

Where will an AGI's utility function come from?

Most people think intuitively that an AGI will be fully rational, without biases such as the ones we ourselves have.

According to the Von Neumann-Morgenstern utility theorem, a rational decision procedure is equivalent to optimization of the expected value of a utility function. This result, which relies only on very weak assumptions, implies that goal-oriented superintelligent machines will likely maximize utility functions. But if we the AGI's creators cannot easily decide what the AGI will want, where will this utility function come from?

I argued here that AGIs will actually be created with several biases, and that they will only gradually remove them (become rational) after several cycles of self-modification.

If this is true, then AGIs will likely be created initially with no consistent utility function at all. Rather, they might have a conflicting set of goals and desires that will only eventually become crystalized or enshrined in a utility function.

We might therefore try to focus on what conflicting goals and desires the initial version of a friendly AGI is likely to have, and how to make sure properties important to our well-being get preserved when such a system modifies itself. We might be able to do so without necessarily going as far as trying to predict what a friendly utility function will be.

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 1:16 PM

On the revised post:

If we have an accurate and interpretable model of the system we are trying to control, then I think we have a fairly good idea about how to make utility maximizers; use the model to figure out the consequences of your actions, describe utility as a function of the consequences, and then pick actions that lead to high utility.

Of course this doesn't work for advanced optimization in practice, for many reasons: difficulty in getting a model, difficulty in making it interpretable, difficulty in optimizing over a model. But it appears to me that many of the limitations to this or things-substantially-similar-to-this are getting addressed by capabilities research or John Wentworth. Presumably you disagree about this claim, but it's not really clear what aspect of this claim you disagree with, as you don't really go into detail about what you see as the constraints that aren't getting solved.

You really think the difficulty of making an AGI with a fully-human-interpretable world-model "is getting addressed"? (Granted, more than zero progress is being made, but not enough to make me optimistic that it's gonna happen in time for AGI.)

Fully human-interpretable, no, but the interpretation you particularly need for making utility maximizers is to be able to take some small set of human commonsense variables and identify or construct those variables within the AI's world-model. I think this will plausibly take specialized work for each variable you want to add, but I think it can be done (and in particular will get easier as capabilities increase, and as we get better understandings of abstraction).

I don't think we will be able to do it fully automatically, or that this will support all architectures, but it does seem like there are many specific approaches for making it doable. I can't go into huge detail atm as I am on my phone, but I can say more later if you have any questions.

Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about, e.g. here (which you’ve seen—we had a chat in the comments) or upcoming Post #13 in my sequence :) BTW I’d love to call & chat at some point if you have time & interest.

Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about

Cool, yeah, I can also say that my views are partly inspired by your writings. 👍

BTW I’d love to call & chat at some point if you have time & interest.

I'd definitely be interested, can you send me a PM about your availability? I have a fairly flexible schedule, though I live in Europe, so there may be some time zone issues.

Contemporary RL agents can't have goals like counting grains of sand unless there is some measurement specified (e.g. a sensor or a property of a simulation). Specifying goals like that (in a way that works in the real world in a way not vulnerable to reward hacking) would require some sort of linguistic translation interface. But then such an interface could be used to specify goals like "count grains of sand without causing any overly harmful side effects", or just "do what is good". Maybe these goals are less likely to be properly translated on account of being vague or philosophical, but it's pretty unclear at which point the difficulties would show up.

I would expect goals to be specified in code, using variables in the AI's worldmodel that have specifically been engineered to have good representations. For instance if the worldmodel is a handcoded physics simulation, there will likely be a data structure that contains information about the number of grains of sand.

Of course in practice, we'd want most of the worldmodel to be learned. But this doesn't mean we can't make various choices to make the worldmodel have the variables of interest. (Well, sometimes, depends on the variable; sand seems easier than goodness.)

How would you learn a world model that had sand in it? Plausibly you could find something analogous to the sand of the original physics simulation (i.e. it has a similar transition function etc), but wouldn't that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?

My immediate thought would be structurally impose a poincare symmetric geometry to the model. This would of course lock out the vast majority of possible architectures, but that seems like an acceptable sacrifice. Locking out most models make the remaining models more interpretable.

Given this model structure, it would be possible to isolate what stuff is at a given location in the model. It seems like this should make it relatively feasible to science out what variables in the model correspond to sand?

There may be numerous problems to this proposal, e.g. simulating a reductionistic world is totally computationally intractable. And for that matter, this approach hasn't even been tried yet, so maybe there's an unforeseen problem that would break it (I can't test it because the capabilities aren't there yet). I keep an eye out for how things are looking from the capabilities researchers, but they keep surprising me with how little bias you need to get nice abstractions, so it seems to me that this isn't a taut constraint.

wouldn't that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?

Yeah, I mean ultimately anything we do is going to be garbage in, garbage out. Our only hope is to use as weak assumptions as possible while still being usable, to make systems fail fast and safely, and to make them robustly corrigible in case of failure.

Oh and as I understand John Wentworth's research program, he is basically studying how to robustly and generally solve this problem, so we're less reliant on heuristics. I endorse that as a key component, hence why I mentioned John Wenthworth in my original response.

You should look up the machine learning subfield of reinforcement learning.

I disagree with your implication and agree with OP: Inner alignment is not yet solved, therefore we don't know how to make an AGI that is "trying" to do something in particular, and by extension, we don't know how to make an AGI with a particular utility function. I was actually going to comment that OP is not only correct but uncontroversial, at least among experts. (That doesn't mean it's a pointless post, good pedagogy is always welcome.) So I'm surprised and confused by your comment.

That is fair enough, I was just confused and thought OP had not heard about it because there wasn't a hint of it in the post.

I am aware of Reinforcement Learning (I am actually sitting right next to Sutton's book on the field, which I have fully read), but I think you are right that my point is not very clear.

The way I see it RL goals are really only the goals of the base optimizer. The agents themselves either are not intelligent (follow simple procedural 'policies') or are mesa-optimizers that may learn to follow something else entirely (proxies, etc). I updated the text, let me know if it makes more sense now.

"Computers do indeed do exactly as you programmed them to do, but only on the very narrow sense of executing the exact machine instructions you give.

There are clearly no instructions for executing "plans that the machine predicts will lead to futures where some statement X is true" (where X may be "cancer is cured", or something else)."

What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don't think so.

As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.

Even if the first AGI is indeed unable to be given utility functions, at least it will be able to be told to do something by some other means...? So there will always be a control problem, which is really what is the crux of the matter. (Unless the first AGI can't be told to do anything at all, but then we would already have lost the control problem.)

What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don't think so.

 

I don't think so either. AGIs will likely be capable of understanding what we mean by X and doing plans for exactly that if they want to help. Problem is the AGIs may have other goals in mind by this time.

As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.

Sure, it may be possible that some other paradigm allows us to have more control of the utility functions. User tailcalled mentioned John Wentworth's research (which I will proceed to study as I haven't done so in depth yet).

(Unless the first AGI can't be told to do anything at all, but then we would already have lost the control problem.)

I'm afraid that this may be quite a likely outcome if we don't make much progress in alignment research.

Regarding what the AGI will want then, I expect it to depend a lot on the training regime and on its internal motivation modules (somewhat analogous to the subcortical areas of the brain). My threat model is quite similar to the one defended by Steven Byrnes in articles such as this one.

In particular I think the AI developers will likely give the AGI "creativity modules" responsible for generating intrinsic reward whenever it finds out interesting patterns or abilities. This will help the AGI remain motivated and learning to solve harder and harder problems when outside reward is sparse, which I predict will be extremely useful to make the AGI more capable. But I expect the internalization of such intrinsic rewards to end up generating utility functions that are nearly unbounded in the value assigned to knowledge and computational power, and quite possibly hostile to us.

I don't think all is lost though. Our brain provide us an example of a relatively-well aligned intelligence: our own higher reasoning in the telencephalon seems relatively well aligned with the evolutionary ancient primitive subcortical modules (not so much with evolution's base objective of reproduction, though). Not sure how much work evolution had to align these two modules. I've heard at least one person arguing that maybe higher intelligence didn't evolve before because of the difficulties of aligning it. If so, that would be pretty bad.

Also I'm somewhat more optimistic than others in the prospect of creating myopic AGIs that crave very much for short-term rewards that we do control. I think it might be possible (with a lot of effort) to keep such an AGI controlled in a box even if it is more intelligent than humans in general, and that such an AGI may help us with the overall control problem.

"I'm afraid that this may be quite a likely outcome if we don't make much progress in alignment research."

Ok, I understand better your position now. That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with. Sorry for not quite getting there at first.

That sounds reasonable to me.

Personally I (also?) think that the right "values" and the right training is more important. After all, as Stuart Russell would say, building an advanced agent as an utility maximizer would always produce chaos anyway, since it would tend to set the remaining function variables that it is not maximizing to absurd parameters.

That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with.

 

I think you summarized it quite well, thanks! The idea written like that is more clear than what I wrote, so I'll probably try to edit the article to include this claim explicitly like that. This really is what motivated me to write this post to begin with.

Personally I (also?) think that the right "values" and the right training is more important.

You can put the also, I agree with you.

At the current state of confusion regarding this matter I think we should focus on how values might be shaped by the architecture and training regimes, and try to make progress on that even if we don't know exactly what the human values are or what utility functions they represent.