1302

LESSWRONG
LW

1301
Instrumental convergenceOrthogonality ThesisAI
Frontpage

2

A Timing Problem for Instrumental Convergence

by rhys southan
30th Jul 2025
1 min read
44

2

This is a linkpost for https://link.springer.com/article/10.1007/s11098-025-02370-4

2

A Timing Problem for Instrumental Convergence
7Fabien Roger
4Fabien Roger
2Cleo Nardo
3rhys southan
2Fabien Roger
2Fabien Roger
1rhys southan
2Fabien Roger
1rhys southan
2Petr Kašpárek
1rhys southan
1Petr Kašpárek
3rhys southan
1rhys southan
1Petr Kašpárek
1rhys southan
1Petr Kašpárek
1rhys southan
1Petr Kašpárek
1rhys southan
1Petr Kašpárek
1rhys southan
1rhys southan
1Søren Elverlin
1Petr Kašpárek
1rhys southan
1Søren Elverlin
-1rhys southan
1Søren Elverlin
0rhys southan
1Søren Elverlin
1rhys southan
1Seth Herd
1rhys southan
1Seth Herd
1rhys southan
3Seth Herd
2rhys southan
3Seth Herd
1rhys southan
2Seth Herd
-1rhys southan
2Seth Herd
1rhys southan
New Comment
44 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:22 PM
[-]Fabien Roger18d72

I don't understand how this is something else than a debate over words.

When an entity "cares about X like ghandi cares about avoiding murder" or "cares about X like a pure egoist cares about his own pleasure" I would call that "having X as terminal goal".[1] Happy to avoid this use of "goal" for the purpose of this conversation but I don't understand why you think it is a bad way to talk about things or why it changes any of the argument about instrumental convergence.

The kind of entities I claim we should be afraid of is the kind of entities that terminally want X in the same way that Gandhi wants to avoid murder or in the same way that a pure egoist wants to pursue his own pleasure at the expense of others, where X is something that is not compatible with human values.

Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values? That looks implausible to me even granting moral realism, I think it is possible to be a pure egoist and to only terminally care about your own pleasure in a way that makes you want to avoid modifications that make you more altruistic but that doesn't look justified on moral realist grounds. (More generally, I think the space of things you could care about in the same way that Gandhi cares about avoiding murder is very large, roughly as large as the space of instrumental goals.)

I don't think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of "caring about X" rather than about the more shallow kind of goals you are talking about. I don't buy these arguments but I think this is where the reasonable disagreement is and redefining "terminal goal" to mean sth weaker than "cares about X like Ghandi cares about murder" is not helpful. 

It might be possible to create AIs that only care about X in the more shallow sense that you are describing in the paper and I agree it would be safer, but I don't think it will be easy to avoid creating agents that care about X in the same way that Gandhi wants to avoid murder. When you chat with current AIs, it looks to me that to the extent they care about things, they care about them in the same way that Gandhi cares about murder (see e.g. the alignment faking paper). Any insight into how to build AIs that don't care about anything in the same way that Gandhi cares about murder?

  1. ^

    Maybe Ghandi cares about murder because of C, and cares about C because of B, ... Eventually that bottoms out in axioms A (e.g. the golden rule), and I would call that his terminal goals. This does not matter for the purpose of the conversation, since Ghandi would probably also resist having his axioms being modified. The case of the pure egoist is a bit clearer, since I think it is possible to have pure egoists who care about their own pleasure without further justification.

Reply
[-]Fabien Roger5d40

One thing I forgot to mention is that there are reasons to expect "we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))" to be true that are stronger than "look at current AIs" / "this is roughly aligned with commercial incentives", such as then ones described by Evan Hubinger here.

TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.

Reply
[-]Cleo Nardo4d20

Carlsmith is a good review of the "Will AIs be be smart consequentialists?" arguments up to late 2023. I think the conversation has progressed a little since then but not massively.

Reply
[-]rhys southan18d30

"When an entity 'cares about X like ghandi cares about avoiding murder' or 'cares about X like a pure egoist cares about his own pleasure' I would call that 'having X as terminal goal.'"

I think I would agree with this, unless you would also claim that "caring about X like a pure egoist cares about his own pleasure" is the only way of having a terminal goal. I would define a terminal goal more broadly as a non-instrumental goal: a goal pursued for its own sake, not for anything else. How a pure egoist cares about his own pleasure might have particular features that some non-instrumental goals might not have. I would still say these latter types of non-instrumental goals are terminal goals.

"Is the claim that you think there is a constraint on X where X needs to be justified on moral realism grounds and is thus guaranteed to not be in conflict with human values?"

No, the paper does not assume moral realism. The point about moral realism in the paper is just this: an agent believing that bringing about X is wrong might have a reason not to change their goals in a way that will cause them to later do X, but the instrumental convergence thesis doesn't assume moral realism, so arguments in favor of goal preservation can't assume moral realism either. 

I agree that even if moral realism is true, a pure egoist might want to stay a pure egoist. 

"I don't think it is obviously true that the space of things you can care about like Ghandi cares about murder is very large. I think arguments that oppose the orthogonality thesis are almost always about this kind of "caring about X" rather than about the more shallow kind of goals you are talking about. I don't buy these arguments but I think this is where the reasonable disagreement is and redefining "terminal goal" to mean sth weaker than "cares about X like Ghandi cares about murder" is not helpful."

This part makes me think you are adopting a more restrictive notion of terminal goals than I would. What's wrong with non-instrumental goals as the definition of a terminal goal? One reason for adopting the broader definition is that we don't know what a superintelligence will be like, so we don't want to assume it will care about things in a human-like way. 

"Any insight into how to build AIs that don't care about anything in the same way that Gandhi cares about murder?"

I haven't thought about how to create a system that has what you call "shallow" goals. It just seems to me that non-instrumental goals can, in principle, take this "shallow" form, especially for agents who (by stipulation) might not have hedonic sensations. 

Reply
[-]Fabien Roger17d20

I think we mostly agree then!

To make sure I understand your stance:

  • You agree that some sorts of terminal goals (like Gandhi's or the egoist's) imply you should protect them (e.g. a preference to maximize E[sum_t V_t0(s_t)])
  • You agree that it's plausible AIs might have this sort of self-preserving terminal goals and that these goals may be misaligned with human values, and that the arguments for instrumental self-preservation do apply to those AIs
  • You think that the strength of arguments for instrumental self-preservation is overrated because of the possibility of building AIs that don't have self-preserving terminal goals
    • You'd prefer if people talked about "self-preserving terminal goals" or sth more specific when making arguments about instrumental self-preservation, since not all forms of caring / having terminal goals imply self-preservation
  • You don't have a specific proposal to build such AIs - this paper is mostly pointing at a part of the option space for building safer AI systems (which is related to proposals about building myopic AIs, though it's not exactly the same thing)

I think we might still have a big disagreement on what sort of goals AIs are likely to be built by default / if we try to avoid self-preserving terminal goals - but it's mostly a quantitative empirical disagreement.

Reply
[-]Fabien Roger17d20

I'd note that I find quite strange all versions of non-self-preserving terminal goals that I know how to formalize. For example maximizing E[sum_t V_t(s_t)] does not result in self-preservation, but instead it results in AIs that would like to self-modify immediately to have very easy to achieve goals (if that was possible). I believe people have also tried and so far failed to come up with satisfying formalisms describing AIs that are indifferent to having their goals be modified / to being shut down.

Reply
[-]rhys southan16d10

I can see how my last comment may have made it seem like I thought some terminal goals should be protected just because they are terminal goals. However, when I said that Gandhi's anti-murder goal and the egoist's self-indulgence goal might have distinct features that not all terminal goals share, I only meant that we need a broad definition of terminal goals to make sure it captures all varieties of terminal goals. I didn't mean to imply anything about the relevance of any potential differences between types of terminal goals. I would not assume that whatever distinguishes an egoist's goal of self-indulgence from an AI's goal of destroying buildings means the egoist should protect his terminal goal even if an AI might not need to. In fact, I doubt that's the case. 

Imagine there are two people. One is named Ally. She's an altruist with a terminal goal of treating all interests exactly as her own. The other is named Egon. He is an egoist with a terminal goal of satisfying only his own interests. Also in the mix is an AI with a terminal goal to destroy buildings. Ally and Egon may have a different sort of relationship to their terminal goals than the AI has to its terminal goal, but if you said, "Ally and Egon should both protect their respective terminal goals," I would need an explanation for this, and I doubt I would agree with whatever that explanation is. 

Do you think that something being a terminal goal is in itself a reason to keep that goal? And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place? 

Reply
[-]Fabien Roger16d20

And/or do you think that keeping a goal is an aspect of what it means to have a goal in the first place?

I meant sth like that (though weaker, I didn't want to claim all goals are like that), though I don't claim this is a good choice of words. I agree it is natural to speak about goals only to refer to its object (e.g. building destruction) and not the additional meta-stuff (e.g. do you maximize E[sum_t V_t0(s_t)] or E[sum_t V_t(s_t)] or sth else?). Maybe "terminal preferences" more naturally cover both objects (what you call goals?) and the meta-stuff. (In the message above I was using "terminal goals" to refer to objects and the meta-stuff.)

I don't know how to call the meta-stuff, it's a bit sad I don't know a good word for it.

With this clarified wording, I think what I said above holds. For example, if I had to frame the risk from instrumental convergence with the slightly more careful wording I would say "it's plausible that AIs will have self-preserving terminal preferences (e.g. like max E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don't have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V wrong, a powerful AI would likely conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover."

I don't love calling it "self-preserving terminal preferences" though because it feels tautological when in fact self-preserving terminal preferences are natural and don't need to involve any explicit reference to self-preservation in their definition. Maybe there is a better word for it.

Reply
[-]rhys southan16d10

"It's plausible that AIs will have self-preserving preferences (e.g. like E[sum_t V_t0(s_t)]). It is likely we will build such AIs because this is roughly how humans are, we don't have a good plan to build very useful AIs that are not like that, and current AIs seem to be a bit like that. And if this is true, and we get V even slightly wrong, a powerful AI might conclude its values are better pursued if it got more power, which means self-preservation and ultimately takeover."

This strikes me as plausible. The paper has a narrow target. It's arguing against the instrumental convergence argument for goal preservation. It argues that we shouldn't expect an AI to preserve its goal on the basis of instrumental rationality alone. However, instrumental goal preservation could be false, yet there could be other reasons to believe a superintelligence would preserve its goals. You're making that kind of case here without appealing to instrumental convergence. 

The drawback to this sort of argument is that it has a narrower scope and relies on more assumptions than Omohundro and Bostrom might prefer. The purpose of the instrumental convergence thesis is to tell us something about any likely superintelligence, even one that is radically different from anything we know, including AIs of today. The argument here is a strong one, but only if we think a superintelligence will not be a totally alien creature. Maybe it won't be, but again, the instrumental convergence thesis doesn't want to assume that. 

Reply
[-]Petr Kašpárek24d21

Hey Rhys, thanks for posting this and trying to seriously engage with the community!

Unfortunately, either I completely misunderstood your argument, or you completely misunderstood what this community considers a goal. It seems you are considering only an extremely myopic AI. I don't have any background knowledge in what is considered a goal in the philosophy that you cite. The concepts of ends-rationality and wide-scope view don't make any sense in my concept of a goal.

Let me try to formalize two things a) your argument and your conception of goal, b) what I (and likely a lot of the people in the community) might consider a goal.

Your model:

We will use a world model and a utility function calculator and describe how an agent behaves based on these two.

  • World Model W:(state,action)→state. Takes the current state and the agent's action and produces the next world state.
  • Utility function calculator U:([current]state,[next]state)→R. Takes the current state and a next state, and calculates the utility of the next state. (The definition is complicated by the fact that we must account for the utility function changing. Thus we assume that the actual utility function is encoded in the current state and we must first extract it.)
  • The agent chooses its action as action=argmaxaU(state,W(state,a)), i.e., the action that maximizes the current utility function.

Example problem:

Suppose that the current utility function is the number of paperclips. Suppose actions a1 and a2 both produce 10 paperclips, however, a2 also changes the utility function to the number of cakes. Both actions have the same utility (since they both produce the same number of paperclips in the next state). Thus the agent can take either action and change its goal.

 My model:

  • World Model W:(state,action)→future. Future might be a sequence of all future states (or even better, a distribution over all sequences of future states.)
  • Utility function calculator U:([current]state,future)→R. Calculates the aggregate utility. We are not making any assumptions about how the aggregation is done.
  • Again, the agent chooses the action that maximizes the utility action=argmaxaU(state,W(state,a)).

Fixed problem:

Again, let the utility be the total number of paperclips the agent makes over its lifespan. Have the same actions, a1 and a2, both producing 10 paperclips, but a2 changes the utility function to the number of cakes. Now the agent cannot choose a2 because the agent would stop making paperclips in the future and thus a2 has a lower utility.

Now a couple of caveats. First, even in my model, the agent might still want to change its utility function, perhaps because it might be turned off if it is found to be a paperclip maximizer. Second, my model is probably not perfect. People that have studied this more closely might have objections. Still, I think it is much closer to what people here might consider a goal. Third, very few people actually expect AI to really work like this. A goal will really be an emergent property of a complex system, like those in the current deep learning paradigm. But this formalism is a useful tool to reason about AI and intelligence.

Let me know if I misunderstood your argument, or if something is unclear in my explanation. 

[This comment is no longer endorsed by its author]Reply
[-]rhys southan20d10

Petr,

Thanks for this response. Wide-scope and narrow-scope don't determine how a goal is defined. These are different theories about what is rationally required of an agent who has a goal, with respect to their goal. 

I would define a goal as some end that an agent intends to bring about. Is this inconsistent with how many people here would see a goal? Or potentially consistent but underspecified? 

Reply
[-]Petr Kašpárek18d10

As I said, I'm not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000's thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.

Goals in people are something different. Their goals are derived from their values.[1] I think you would say that people are ends-rational. In my opinion, in your line of thought it would be more helpful to think of AI goals as more akin to people's values. Both people's values and AI goals are something fundamental and unchangeable. You might argue that people do change their values sometimes, but what I'm really aiming at are fundamental hard-to-describe beliefs like "I want my values to be consistent."

Overall, I'm actually not really sure how useful this line of investigation into goals is. For example, Dan Hendrycks has a paper on AI risk, where he doesn't assume goal preservation; on the contrary, he talks about goal drift and how it can be dangerous (section 5.2). I suggest you check it out.

  1. ^

    I'm sure there is also a plethora of philosophical debate about what goals (in people) really are and how they are derived. Same for values. 

Reply
[-]rhys southan18d30

The instrumental convergence thesis doesn't depend on being applied to a digital agent. It's supposed to apply to all rational agents. So, for this paper, there's no reason to assume the goal takes the form of code written into a system. 

There may be a way to lock an AI agent into a certain pattern of behaviour or a goal that it can't revise, by writing code in the right way. But if an AI keeps its goal because it can't change its goal, that has nothing to do with the instrumental convergence thesis. 

If an agent can change its goal through self-modification, the instrumental convergence thesis could be relevant. If an agent could change its goal through self-modification, I'd argue the agent does not behave in an instrumentally irrational way if it modifies itself to abandon its goal.

The paper doesn't take a stance on whether humans are ends-rational. If we are, this could sometimes lead us to question our goals and abandon them. For instance, a human might have a terminal goal to have consistent values, then later decide consistency doesn't matter in itself and abandon that terminal goal and adopt inconsistent values. The paper assumes a superintelligence won't be ends-rational since the orthogonality thesis is typically paired with the instrumental convergence thesis, and since it's trivial to show that ends-rationality could lead to goal change.   

In this paper, a relevant difference between humans and an AI is that an AI might not have well-being. Imagine there is one human left on earth. The human has a goal to have consistent values, then abandons that goal and adopts inconsistent values. The paper's argument is the human hasn't behaved in an instrumentally irrational way. The same would be true for an AI that abandons a goal to have consistent values. 

This potential well-being difference between humans and AIs (of humans having well-being and AIs lacking it) becomes relevant when goal preservation or goal abandonment affects well-being. If having consistent values improves the hypothetical human's well-being, and the human abandons this goal of having consistent values and then adopts inconsistent values, the human's well-being has lowered. With respect to prudential value, the human has made a mistake. 

If an AI does not have well-being, abandoning a goal can't lead to a well-being-reducing mistake, so it lacks this separate reason to goal preserve. An AI might have well-being, in which case it might have well-being-based reasons to goal preserve or goal abandon. The argument in this paper assumes a hypothetical superintelligence without well-being, since the instrumental convergence thesis is meant to apply to those too. 

Reply
[-]rhys southan18d10

It just occurred to me that since you implied that ends-rationality would make goal abandonment less likely, you might be using it in a different way than me, to refer to terminal goals. The paper assumes an AI will have terminal goals, just as humans do, and that these terminal goals are what can be abandoned. Ends-rationality provides one route to abandoning terminal goals. The paper's argument is that goal abandonment is also possible without this route. 

Reply
[-]Petr Kašpárek24d10

It seems that your paper is basically describing the Theorem 14 of Self-Modification of Policy and Utility Function in Rational Agents by tom4everitt, DanielFilan, Mayank Daswani, and Marcus Hutter. Though I haven't read their paper in detail.

Reply
[-]rhys southan20d10

I wasn't aware of this paper before you linked it here, but I looked at it now. I'm not sure how well I follow the Theorem 14 section, but if the title of the theorem ("Hedonistic agents self-modify") is anything to go by, our arguments are different. Our argument is not about hedonistic agents, and we're not claiming that AIs will self-modify. Our point is just that it would not be instrumentally irrational of the AIs to change their goals, if they did. 

Reply
[-]Petr Kašpárek18d10

I'm looking more closely at the Everrit et al. paper and I'm less sure I actually understood you. Everrit et al.'s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.

Reply
[-]rhys southan18d10

I don't assume A or B. The argument is not about what maximally satisfies an agent. Goal abandonment need not satisfy anything. The point is just that goal abandonment does not dissatisfy anything. 

Reply
[-]Petr Kašpárek18d10

Then I don't really understand your argument. 

As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.

Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number of occurrences when she is presented a cake but does not eat it. As you said, you assume that she evaluates the future with her current goal. She reasons:

  1. Preserve the goal. Tomorrow I will be presented a cake and eat it. Number of failures: 0
  2. Abandon the goal. Tomorrow I will be presented a cake and fail to it. Number of failures: 1

Ronya preserves the goal.

Reply
[-]rhys southan18d10

The paper argues that the number of failures in 2 (goal abandonment) is also 0. This is because it is no longer her goal once she abandons it. She fails by "the goal" but never fails by "her goal." Cake isn't the best case for this. The argument for this is in 3.4 and 3.5.

Reply
[-]Petr Kašpárek18d10

You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it

Means-rationality does not prohibit setting oneself up to fail concerning a goal one currently has but will not have at the moment of failure, as this never causes an agent to fail to achieve the goal that they have at the time of failing to achieve it. 

Reply
[-]rhys southan18d10

They could be using their current goal to evaluate the future, but include in the future that they won't have that goal. This doesn't require excluding this goal from their analysis all altogether. It's just that they evaluate that the failure of this goal is irrelevant in a future in which they don't have the goal. 

Reply
[-]rhys southan18d10

Maybe this is still B, in which case I might have interpreted it more strictly than you intended. 

Reply
[-]Søren Elverlin18d10

I second Petr's comment: Your definition relates to myopic agents. Consider two utility functions for a paperclip-maximizer:

  1. Myopic paperclip-maximizer: Utility is the number of paperclips in existence right now
  2. Paperclip-maximizer: Utility is the number of paperclips that will eventually exist

A myopic paperclip-maximizer will suffer from the timing problem you described: When faced with an action that creates a superior number of paperclips and also changes the utility function, the myopic maximizer will take this action.

The standard paperclip-maximizer will not. It considers not just the actions it can take right now, but all actions throughout the future. Crucially, it evaluates these actions against the current goal, not the goal it would have at that time. It does not evaluate these actions against what utility the maximizer would later have.

Reply
[-]Petr Kašpárek18d10

I would add two things.

First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.[1]

Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.

  1. ^

    There is an exception to this. If the new goal still makes the agent take an optimal action in the second step, it can change to it.

    For example, if the paperclip maximizer has no materials (and due to its myopia can't really plan to obtain any), it can change its goal while it's idling because all actions make zero paperclips.

    A more sophisticated example. Suppose the goal is "make paperclips and don't kill anyone." (If we wanted to frame it as a utility function, we could say: number of paperclips − killed people × a very large number.) Suppose an optimal two-step plan is: 1. obtain materials 2. make paperclips. However, what if, in the first step, the agent changes its goal to just making paperclips. As long as there is no possible action in the second step that makes more paperclips while killing people, the agent will take the same action in the second step even with the changed goal. Thus changing the goal in the first step is also an optimal action.

Reply
[-]rhys southan18d10

The timing problem is not a problem for agents. It's a problem for the claim that goal preservation is instrumentally required for rational agents. The timing problem doesn't force agents to take any particular decision. The argument is that it's not instrumentally irrational for a rational agent to abandon its goal. It isn't about any specific utility functions, and it isn't a prediction about what an agent will do.  

Reply
[-]Søren Elverlin18d10

The timing problem is a problem for how well we can predict the actions of myopic agents: Any agent that has a myopic utility function has no instrumental convergent reason for goal preservation.

Reply
[-]rhys southan18d-10

Have you read the paper?

Reply
[-]Søren Elverlin18d10

I did read 2/3rd of the paper, and I tried my best to understand it, but apparently I failed.

Reply
[-]rhys southan18d00

The reason I suspect you haven't is that whether an agent is "myopic" or not is irrelevant to the argument. Where we may disagree is over the nature of goal having, as Seth Herd pointed out. If you want to find a challenge to the argument, that's the place to look.

Reply
[-]Søren Elverlin18d10

It is possible that we also disagree on the nature of goal having. I reserve the right to find my own places to challenge your argument.

Reply
[-]rhys southan18d10

Ha, yes, fair enough 

Reply
[-]Seth Herd2mo10

This is an important consideration, so if it were false, that would be important. The research I'm working on does have the assumption of goal preservation as an instrumentally convergent subgoal.

Based on your description in the abstract, I don't understand. How it could not be instrumental to preserve any goal based on timing? Suppose I have a goal right now. Changing it or deliberately allowing it to change would be irrational, because doing so will prevent me from reaching that goal. Right? Of course if I change my goal by accident, I no longer have the old goal, and am not rationally bothered by that after the accidental change. Is that the timing perspective you're referring to? If so I don't see the relevance for alignment. If not, what timing are you referring to?

What gives?

Reply
[-]rhys southan2mo10

Yes, the article is about intentional goal change, not accidental goal change. Section 3 of the article addresses objections to the main argument. If you don't want to read the whole article, you could skip to that section. If you don't want to read all of section 3, I would suggest reading section 3.1 for the summary of the main argument, then skip to section 3.4 for the "delay objection" and 3.5 for the "goal-first objection." Those are probably the most relevant subsections for you. 

Reply
[-]Seth Herd2mo10

I hadn't seen this comment when I wrote the above response. But it still stands.

If you want people to read your article on a counterintuitive result, it seems like you really need to put the central argument in the abstract. Otherwise it's pretty reasonable to assume that the central argument is fatally flawed.

The fact that your response doesn't answer the obvious question either makes me even less optimistic about finding that answer if I take the time to read the whole paper.

If there's just a miscommunication issue, in which there is a good answer but it's just not in the abstract or that response, or it is but I'm missing it, I'd like to help you improve the communication of your argument.

Reply
[-]rhys southan2mo10

No need to read the whole paper. Section 3.4 is meant to respond to this sort of objection. You could skip to that. 

Reply
[-]Seth Herd2mo30

As you said in the other comment, reading 3.4 and 3.5 were necessary. That's because the actual argument you're making is quite complex and describing it as a timing issue seems quite wrong.

I think you're envisioning a goal that is not a goal. Your imagining a goal that is not a desire. That removes the functional property we are usually thinking of a goal having. If you don't care whether your goal is accomplished, it's not a goal in common parlance.

That is the meat of the argument. It's about the nature of goals when separated from desires. That is relevant to the question of alignment and goals. I suspect you've created a contradictory assumption by thinking of a goal that doesn't have the properties of a desire, but I'm not certain.

If we have called something a goal but also defined it as not having the functional properties of a desire that cause a system to pursue something, how is that remainder still what we'd call a goal? If I tell my boss I have a goal of finishing that report but have no desire to finish it, I'd say it wasn't really much of a goal and more of a lie. Maybe I meant I'd finish it if nothing I actually desire comes up in the meantime; then I'd say it had a little bit of desire attached to it and just not much.

That's not a full analysis and I don't intend to do one right now. I need to get back to my work which very much assumes that it's irrational to knowingly allow one's goal to change. That's assuming that the goal also has the properties of a desire; that there are functional mechanisms that will cause the system to make decisions that are estimated to cause that goal to be accomplished in the future. A decision to abandon that goal would cause it to not be accomplished, so those functional mechanisms would prevent that decision from being made.

There is some interesting stuff there, but the abstract does not point to it accurately, so I maintain that just not reading the article is the sensible response to that abstract. I strongly suggest you change it.

Reply
[-]rhys southan2mo20

I'm glad you got more out of the argument after reading those sections. I agree that "timing problem" is not the best description for the argument. Calling it "the timing problem" was a relic of an earlier version of the argument that was more about timing. After submitting the paper and getting a revise and resubmit from the journal, I got some feedback from my supervisor that made me realize timing wasn't the real issue. So, I changed the argument to make it less timing-based. However, I worried that changing the title of the argument and of the paper for a resubmission might disqualify the paper for further consideration (because it might count as a different paper at that point). Maybe that was overly conservative. I might have risked coming up with a new name for the argument and paper if I thought the name of the argument would deter people from reading the paper. 

As for your critique after reading the sections, you've picked up on the issue I know I need to elaborate on! I'm working on a paper that is precisely about this, and have been since late last year. If you don't mind, I might reach out to you for feedback once I have a finished draft. 

I appreciate the feedback about the abstract, but the abstract is set in stone at this point. 

Reply
[-]Seth Herd2mo30

Ah yes, I suspected that the incentive structure of academic philosophy and publishing in a philosophy journal was a big part of the issue here.

I'd be happy to help with your next paper if you want to talk through the ideas. I'd be less excited to contribute if you've already finished a draft. I feel that collaboration is more useful in the idea development stage than the polishing stage. At that point, there's a lot of real sunk cost so outside contributions on the important parts of the argument become much less useful. Actually, I think that's exactly what you described in being unable to really take your advisor's good advice into account in framing this paper because it came too late in the process.

Reply
[-]rhys southan2mo10

Just to clarify, I did change the argument after meeting with my supervisor, which is reflected in the final published draft. He said he didn't think timing was the issue, so I figured out a better way to word the argument. The things I didn't change were the argument title and paper title. I left "the timing problem" as the name for those, even though the argument wasn't as obviously about timing anymore, because I thought changing the names might be a problem (and it's entirely possible I was wrong about that). I thought that's what you had noticed: that the name of the argument didn't fully suit the argument itself. That's because I changed the argument but not the name of it. 

Reply
[-]Seth Herd2mo20

Yes, I understood all of that and that's what I was referring to.

That change of argument but not title or abstract were exactly why I found the post so frustrating. The abstract didn't actually give a good argument, because you'd changed the central argument but couldn't change the title and didn't change the abstract that much. I suspected that the practices and incentives of academic philosophy were somehow at fault. They were.

Reply
[-]rhys southan2mo-10

"How it could not be instrumental to preserve any goal..." 

The argument is not that goal preservation isn't instrumentally useful for achieving a goal. Preserving a goal normally increases the probability of achieving the goal. So, preserving a goal can be instrumentally rational, and usually is. The argument is just that abandoning a goal is not instrumentally irrational; instrumental rationality doesn't prohibit it. Abandoning goal X makes you worse at achieving goal X, yes, but that doesn't matter instrumentally, because once you abandon goal X, it's not your goal anymore, so instrumental rationality doesn't require taking the means to achieve goal X. It's "the timing problem" because there's no point at which abandoning a goal is instrumentally irrational.  

There are objections to this, but I won't rehash those here, since this is a linkpost. I'd be interested in your take after you've read section 3, if you have a chance to look at it.  

Reply
[-]Seth Herd2mo20

Prior to abandoning the goal, abandoning it is irrational. Allowing oneself to deliberately abandon a goal in the future is irrational. I don't see how what you just said addresses this. I don't see a valid argument, so I don't want to read a whole paper on objections to that argument. Your argument addresses the times after abandoning the goal, it does not address all of the times before that at all, as far as I can see.

Thus, your statement that there is no point at which it's irrational to abandon a goal seems wrong. All actions are initiated before they happen. It's irrational to initiate the action of abandoning a goal. Like I said, it can happen by accident, but failing to plan to prevent It is irrational.

If I'm missing something here, please explain?

Reply
[-]rhys southan2mo10

Yeah, read sections 3.4 and 3.5. These are meant to address your objection here. Especially 3.4. You're making what we call "the delay objection."

Reply
Moderation Log
More from rhys southan
View more
Curated and popular this week
44Comments
Instrumental convergenceOrthogonality ThesisAI
Frontpage

This paper of mine ("A Timing Problem for Instrumental Convergence"), co-authored with Helena Ward and Jen Semler, was recently accepted in Philosophical Studies for a superintelligent robots issue (open access). The paper argues that instrumental rationality doesn't require goal preservation/goal-content integrity/goal stability. Here is the abstract:

Those who worry about a superintelligent AI destroying humanity often appeal to the instrumental convergence thesis—the claim that even if we don’t know what a superintelligence’s ultimate goals will be, we can expect it to pursue various instrumental goals which are useful for achieving most ends. In this paper, we argue that one of these proposed goals is mistaken. We argue that instrumental goal preservation—the claim that a rational agent will tend to preserve its goals because that makes it better at achieving its goals—is false on the basis of the timing problem: an agent which abandons or otherwise changes its goal does not thereby fail to take a required means for achieving a goal it has. Our argument draws on the distinction between means-rationality (adopting suitable means to achieve an end) and ends-rationality (choosing one’s ends based on reasons). Because proponents of the instrumental convergence thesis are concerned with means-rationality, we argue, they cannot avoid the timing problem. After defending our argument against several objections, we conclude by considering the implications our argument has for the rest of the instrumental convergence thesis and for AI safety more generally.