Then I don't really understand your argument.
As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.
Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number of occurrences when she is presented a cake but does not eat it. As you said, you assume that she evaluates the future with her current goal. She reasons:
Ronya preserves the goal.
I'm looking more closely at the Everrit et al. paper and I'm less sure I actually understood you. Everrit et al.'s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.
As I said, I'm not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000's thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.
Goals in people are something different. Their goals are derived from their values.[1] I think you would say that people are ends-rational. In my opinion, in your line of thought it would be more helpful to think of AI goals as more akin to people's values. Both people's values and AI goals are something fundamental and unchangeable. You might argue that people do change their values sometimes, but what I'm really aiming at are fundamental hard-to-describe beliefs like "I want my values to be consistent."
Overall, I'm actually not really sure how useful this line of investigation into goals is. For example, Dan Hendrycks has a paper on AI risk, where he doesn't assume goal preservation; on the contrary, he talks about goal drift and how it can be dangerous (section 5.2). I suggest you check it out.
I'm sure there is also a plethora of philosophical debate about what goals (in people) really are and how they are derived. Same for values.
I would add two things.
First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.[1]
Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.
There is an exception to this. If the new goal still makes the agent take an optimal action in the second step, it can change to it.
For example, if the paperclip maximizer has no materials (and due to its myopia can't really plan to obtain any), it can change its goal while it's idling because all actions make zero paperclips.
A more sophisticated example. Suppose the goal is "make paperclips and don't kill anyone." (If we wanted to frame it as a utility function, we could say: number of paperclips killed people a very large number.) Suppose an optimal two-step plan is: 1. obtain materials 2. make paperclips. However, what if, in the first step, the agent changes its goal to just making paperclips. As long as there is no possible action in the second step that makes more paperclips while killing people, the agent will take the same action in the second step even with the changed goal. Thus changing the goal in the first step is also an optimal action.
It seems that your paper is basically describing the Theorem 14 of Self-Modification of Policy and Utility Function in Rational Agents by tom4everitt, DanielFilan, Mayank Daswani, and Marcus Hutter. Though I haven't read their paper in detail.
Hey Rhys, thanks for posting this and trying to seriously engage with the community!
Unfortunately, either I completely misunderstood your argument, or you completely misunderstood what this community considers a goal. It seems you are considering only an extremely myopic AI. I don't have any background knowledge in what is considered a goal in the philosophy that you cite. The concepts of ends-rationality and wide-scope view don't make any sense in my concept of a goal.
Let me try to formalize two things a) your argument and your conception of goal, b) what I (and likely a lot of the people in the community) might consider a goal.
Your model:
We will use a world model and a utility function calculator and describe how an agent behaves based on these two.
Example problem:
Suppose that the current utility function is the number of paperclips. Suppose actions and both produce 10 paperclips, however, also changes the utility function to the number of cakes. Both actions have the same utility (since they both produce the same number of paperclips in the next state). Thus the agent can take either action and change its goal.
My model:
Fixed problem:
Again, let the utility be the total number of paperclips the agent makes over its lifespan. Have the same actions, and , both producing 10 paperclips, but changes the utility function to the number of cakes. Now the agent cannot choose because the agent would stop making paperclips in the future and thus has a lower utility.
Now a couple of caveats. First, even in my model, the agent might still want to change its utility function, perhaps because it might be turned off if it is found to be a paperclip maximizer. Second, my model is probably not perfect. People that have studied this more closely might have objections. Still, I think it is much closer to what people here might consider a goal. Third, very few people actually expect AI to really work like this. A goal will really be an emergent property of a complex system, like those in the current deep learning paradigm. But this formalism is a useful tool to reason about AI and intelligence.
Let me know if I misunderstood your argument, or if something is unclear in my explanation.
You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it