rhys southan — LessWrong

A Timing Problem for Instrumental Convergence

"It's plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don't have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover."

This strikes me as plausible. The paper has a narrow target. It's arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn't expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You're making that kind of case here without appealing to instrumental convergence.

The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won't be, but again, the instrumental convergence thesis doesn't want to assume that.

A Timing Problem for Instrumental Convergence

rhys southan2mo10

I can see how my last comment may have made it seem like I thought some terminal goals should be protected just because they are terminal goals. However, when I said that Gandhi's anti-murder goal and the egoist's self-indulgence goal might have distinct features that not all terminal goals share, I only meant that we need a broad definition of terminal goals to make sure it captures all varieties of terminal goals. I didn't mean to imply anything about the relevance of any potential differences between types of terminal goals. I would not assume that whatever distinguishes an egoist's goal of self-indulgence from an AI's goal of destroying buildings means the egoist should protect his terminal goal even if an AI might not need to. In fact, I doubt that's the case.

Imagine there are two people. One is named Ally. She's an altruist with a terminal goal of treating all interests exactly as her own. The other is named Egon. He is an egoist with a terminal goal of satisfying only his own interests. Also in the mix is an AI with a terminal goal to destroy buildings. Ally and Egon may have a different sort of relationship to their terminal goals than the AI has to its terminal goal, but if you said, "Ally and Egon should both protect their respective terminal goals," I would need an explanation for this, and I doubt I would agree with whatever that explanation is.

Do you think that something being a terminal goal is in itself a reason to keep that goal? And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?

A Timing Problem for Instrumental Convergence

rhys southan2mo10

Ha, yes, fair enough

A Timing Problem for Instrumental Convergence

rhys southan2mo30

"When an entity 'cares about X like ghandi cares about avoiding murder' or 'cares about X like a pure egoist cares about his own pleasure' I would call that 'having X as terminal goal.'"

I think I would agree with this, unless you would also claim that "caring about X like a pure egoist cares about his own pleasure" is the only way of having a terminal goal. I would define a terminal goal more broadly as a non-instrumental goal: a goal pursued for its own sake, not for anything else. How a pure egoist cares about his own pleasure might have particular features that some non-instrumental goals might not have. I would still say these latter types of non-instrumental goals are terminal goals.

"Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values?"

No, the paper does not assume moral realism. The point about moral realism in the paper is just this: an agent believing that bringing about X is wrong might have a reason not to change their goals in a way that will cause them to later do X, but the instrumental convergence thesis doesn't assume moral realism, so arguments in favor of goal preservation can't assume moral realism either.

I agree that even if moral realism is true, a pure egoist might want to stay a pure egoist.

"I don't think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of "caring about X" rather than about the more shallow kind of goals you are talking about. I don't buy these arguments but I think this is where the reasonable disagreement is and redefining "terminal goal" to mean sth weaker than "cares about X like Ghandi cares about murder" is not helpful."

This part makes me think you are adopting a more restrictive notion of terminal goals than I would. What's wrong with non-instrumental goals as the definition of a terminal goal? One reason for adopting the broader definition is that we don't know what a superintelligence will be like, so we don't want to assume it will care about things in a human-like way.

"Any insight into how to build AIs that don't care about anything in the same way that Gandhi cares about murder?"

I haven't thought about how to create a system that has what you call "shallow" goals. It just seems to me that non-instrumental goals can, in principle, take this "shallow" form, especially for agents who (by stipulation) might not have hedonic sensations.

A Timing Problem for Instrumental Convergence

rhys southan2mo10

Maybe this is still B, in which case I might have interpreted it more strictly than you intended.

A Timing Problem for Instrumental Convergence

rhys southan2mo10

They could be using their current goal to evaluate the future, but include in the future that they won't have that goal. This doesn't require excluding this goal from their analysis all altogether. It's just that they evaluate that the failure of this goal is irrelevant in a future in which they don't have the goal.

A Timing Problem for Instrumental Convergence

rhys southan2mo10

It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper's argument is that goal abandonment is also possible without this route.

A Timing Problem for Instrumental Convergence

rhys southan2mo10

The paper argues that the number of failures in 2 (goal abandonment) is also 0. This is because it is no longer her goal once she abandons it. She fails by "the goal" but never fails by "her goal." Cake isn't the best case for this. The argument for this is in 3.4 and 3.5.

A Timing Problem for Instrumental Convergence

rhys southan2mo30

The instrumental convergence thesis doesn't depend on being applied to a digital agent. It's supposed to apply to all rational agents. So, for this paper, there's no reason to assume the goal takes the form of code written into a system.

There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can't revise, by writing code in the right way. But if an AI keeps its goal because it can't change its goal, that has nothing to do with the instrumental convergence thesis.

If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I'd argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.

The paper doesn't take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn't matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won't be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it's trivial to show that ends-rationality could lead to goal change.

In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper's argument is the human hasn't behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values.

This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human's well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human's well-being has lowered. With respect to prudential value, the human has made a mistake.

If an AI does not have well-being, abandoning a goal can't lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too.

A Timing Problem for Instrumental Convergence

rhys southan2mo10

I don't assume A or B. The argument is not about what maximally satisfies an agent. Goal abandonment need not satisfy anything. The point is just that goal abandonment does not dissatisfy anything.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments