We cannot directly choose an AGI's utility function

On the revised post:

If we have an accurate and interpretable model of the system we are trying to control, then I think we have a fairly good idea about how to make utility maximizers; use the model to figure out the consequences of your actions, describe utility as a function of the consequences, and then pick actions that lead to high utility.

Of course this doesn't work for advanced optimization in practice, for many reasons: difficulty in getting a model, difficulty in making it interpretable, difficulty in optimizing over a model. But it appears to me that many of the limitations to this or things-substantially-similar-to-this are getting addressed by capabilities research or John Wentworth. Presumably you disagree about this claim, but it's not really clear what aspect of this claim you disagree with, as you don't really go into detail about what you see as the constraints that aren't getting solved.

[-]Steven Byrnes4y40

You really think the difficulty of making an AGI with a fully-human-interpretable world-model "is getting addressed"? (Granted, more than zero progress is being made, but not enough to make me optimistic that it's gonna happen in time for AGI.)

[-]tailcalled4y*40

Fully human-interpretable, no, but the interpretation you particularly need for making utility maximizers is to be able to take some small set of human commonsense variables and identify or construct those variables within the AI's world-model. I think this will plausibly take specialized work for each variable you want to add, but I think it can be done (and in particular will get easier as capabilities increase, and as we get better understandings of abstraction).

I don't think we will be able to do it fully automatically, or that this will support all architectures, but it does seem like there are many specific approaches for making it doable. I can't go into huge detail atm as I am on my phone, but I can say more later if you have any questions.

[-]Steven Byrnes4y20

Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about, e.g. here (which you’ve seen—we had a chat in the comments) or upcoming Post #13 in my sequence :) BTW I’d love to call & chat at some point if you have time & interest.

[-]tailcalled4y20

Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about

Cool, yeah, I can also say that my views are partly inspired by your writings. 👍

BTW I’d love to call & chat at some point if you have time & interest.

I'd definitely be interested, can you send me a PM about your availability? I have a fairly flexible schedule, though I live in Europe, so there may be some time zone issues.

[-]jessicata4y20

Contemporary RL agents can't have goals like counting grains of sand unless there is some measurement specified (e.g. a sensor or a property of a simulation). Specifying goals like that (in a way that works in the real world in a way not vulnerable to reward hacking) would require some sort of linguistic translation interface. But then such an interface could be used to specify goals like "count grains of sand without causing any overly harmful side effects", or just "do what is good". Maybe these goals are less likely to be properly translated on account of being vague or philosophical, but it's pretty unclear at which point the difficulties would show up.

[-]tailcalled4y20

I would expect goals to be specified in code, using variables in the AI's worldmodel that have specifically been engineered to have good representations. For instance if the worldmodel is a handcoded physics simulation, there will likely be a data structure that contains information about the number of grains of sand.

Of course in practice, we'd want most of the worldmodel to be learned. But this doesn't mean we can't make various choices to make the worldmodel have the variables of interest. (Well, sometimes, depends on the variable; sand seems easier than goodness.)

[-]jessicata4y20

How would you learn a world model that had sand in it? Plausibly you could find something analogous to the sand of the original physics simulation (i.e. it has a similar transition function etc), but wouldn't that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?

[-]tailcalled4y40

My immediate thought would be structurally impose a poincare symmetric geometry to the model. This would of course lock out the vast majority of possible architectures, but that seems like an acceptable sacrifice. Locking out most models make the remaining models more interpretable.

Given this model structure, it would be possible to isolate what stuff is at a given location in the model. It seems like this should make it relatively feasible to science out what variables in the model correspond to sand?

There may be numerous problems to this proposal, e.g. simulating a reductionistic world is totally computationally intractable. And for that matter, this approach hasn't even been tried yet, so maybe there's an unforeseen problem that would break it (I can't test it because the capabilities aren't there yet). I keep an eye out for how things are looking from the capabilities researchers, but they keep surprising me with how little bias you need to get nice abstractions, so it seems to me that this isn't a taut constraint.

wouldn't that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?

Yeah, I mean ultimately anything we do is going to be garbage in, garbage out. Our only hope is to use as weak assumptions as possible while still being usable, to make systems fail fast and safely, and to make them robustly corrigible in case of failure.

[-]tailcalled4y40

Oh and as I understand John Wentworth's research program, he is basically studying how to robustly and generally solve this problem, so we're less reliant on heuristics. I endorse that as a key component, hence why I mentioned John Wenthworth in my original response.

[-]tailcalled4y20

You should look up the machine learning subfield of reinforcement learning.

[-]Steven Byrnes4y40

I disagree with your implication and agree with OP: Inner alignment is not yet solved, therefore we don't know how to make an AGI that is "trying" to do something in particular, and by extension, we don't know how to make an AGI with a particular utility function. I was actually going to comment that OP is not only correct but uncontroversial, at least among experts. (That doesn't mean it's a pointless post, good pedagogy is always welcome.) So I'm surprised and confused by your comment.

[-]tailcalled4y40

That is fair enough, I was just confused and thought OP had not heard about it because there wasn't a hint of it in the post.

[-]azsantosk4y*30

I am aware of Reinforcement Learning (I am actually sitting right next to Sutton's book on the field, which I have fully read), but I think you are right that my point is not very clear.

The way I see it RL goals are really only the goals of the base optimizer. The agents themselves either are not intelligent (follow simple procedural 'policies') or are mesa-optimizers that may learn to follow something else entirely (proxies, etc). I updated the text, let me know if it makes more sense now.

[-]superads914y*10

"Computers do indeed do exactly as you programmed them to do, but only on the very narrow sense of executing the exact machine instructions you give.

There are clearly no instructions for executing "plans that the machine predicts will lead to futures where some statement X is true" (where X may be "cancer is cured", or something else)."

What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don't think so.

As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.

Even if the first AGI is indeed unable to be given utility functions, at least it will be able to be told to do something by some other means...? So there will always be a control problem, which is really what is the crux of the matter. (Unless the first AGI can't be told to do anything at all, but then we would already have lost the control problem.)

[-]azsantosk4y30

What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don't think so.

I don't think so either. AGIs will likely be capable of understanding what we mean by X and doing plans for exactly that if they want to help. Problem is the AGIs may have other goals in mind by this time.

As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.

Sure, it may be possible that some other paradigm allows us to have more control of the utility functions. User tailcalled mentioned John Wentworth's research (which I will proceed to study as I haven't done so in depth yet).

(Unless the first AGI can't be told to do anything at all, but then we would already have lost the control problem.)

I'm afraid that this may be quite a likely outcome if we don't make much progress in alignment research.

Regarding what the AGI will want then, I expect it to depend a lot on the training regime and on its internal motivation modules (somewhat analogous to the subcortical areas of the brain). My threat model is quite similar to the one defended by Steven Byrnes in articles such as this one.

In particular I think the AI developers will likely give the AGI "creativity modules" responsible for generating intrinsic reward whenever it finds out interesting patterns or abilities. This will help the AGI remain motivated and learning to solve harder and harder problems when outside reward is sparse, which I predict will be extremely useful to make the AGI more capable. But I expect the internalization of such intrinsic rewards to end up generating utility functions that are nearly unbounded in the value assigned to knowledge and computational power, and quite possibly hostile to us.

I don't think all is lost though. Our brain provide us an example of a relatively-well aligned intelligence: our own higher reasoning in the telencephalon seems relatively well aligned with the evolutionary ancient primitive subcortical modules (not so much with evolution's base objective of reproduction, though). Not sure how much work evolution had to align these two modules. I've heard at least one person arguing that maybe higher intelligence didn't evolve before because of the difficulties of aligning it. If so, that would be pretty bad.

Also I'm somewhat more optimistic than others in the prospect of creating myopic AGIs that crave very much for short-term rewards that we do control. I think it might be possible (with a lot of effort) to keep such an AGI controlled in a box even if it is more intelligent than humans in general, and that such an AGI may help us with the overall control problem.

[-]superads914y20

"I'm afraid that this may be quite a likely outcome if we don't make much progress in alignment research."

Ok, I understand better your position now. That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with. Sorry for not quite getting there at first.

That sounds reasonable to me.

Personally I (also?) think that the right "values" and the right training is more important. After all, as Stuart Russell would say, building an advanced agent as an utility maximizer would always produce chaos anyway, since it would tend to set the remaining function variables that it is not maximizing to absurd parameters.

[-]azsantosk4y10

That is, that we shouldn't worry so much about what to tell the genie in the lamp, because we probably won't even have a say to begin with.

I think you summarized it quite well, thanks! The idea written like that is more clear than what I wrote, so I'll probably try to edit the article to include this claim explicitly like that. This really is what motivated me to write this post to begin with.

Personally I (also?) think that the right "values" and the right training is more important.

You can put the also, I agree with you.

At the current state of confusion regarding this matter I think we should focus on how values might be shaped by the architecture and training regimes, and try to make progress on that even if we don't know exactly what the human values are or what utility functions they represent.

LESSWRONG
LW

LESSWRONG
LW

13

We cannot directly choose an AGI's utility function

13

13

Is it easy to give an intelligent machine a goal?

Where will an AGI's utility function come from?