Petr Kašpárek — LessWrong

I think hunger strikes, at least those AI safety ones, are primarily about attracting attention, similar to people gluing themselves to roads or destroying paintings.

If some random dude or an AI (safety) researcher is doing a hunger strike, people might go and look at their arguments or anything else AI safety related.

If terminally ill people are doing a hunger strike for development of AI, a lot of people will see that the terminally ill might be biased and have much to gain and little to lose.

Søren Elverlin's Shortform

Petr Kašpárek1mo10

I don't quite understand your concept of symmetrical/asymmetrical tool. You want a tool that is likely to move you from a world that AI will destroy to a world where AI won't do just that.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo10

You are clearly assuming B, i.e. not using the current goal to evaluate the future. You even explicitly state it

Means-rationality does not prohibit setting oneself up to fail concerning a goal one currently has but will not have at the moment of failure, as this never causes an agent to fail to achieve the goal that they have at the time of failing to achieve it.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo10

Then I don't really understand your argument.

As Ronya gets ready for bed on Monday night, she deliberates about whether to change her goal. She has two options: (1) she can preserve her goal of eating cake when presented to her, or (2) she can abandon her goal. Ronya decides to abandon her goal of eating cake when presented to her. On Tuesday, a friend offers Ronya cake and Ronya declines.

Could you explain to me how does Ronya not violate her goal on Monday night? Let me reformulate the goal, so it is more formal. Ronya wants to minimize the number of occurrences when she is presented a cake but does not eat it. As you said, you assume that she evaluates the future with her current goal. She reasons:

Preserve the goal. Tomorrow I will be presented a cake and eat it. Number of failures: 0
Abandon the goal. Tomorrow I will be presented a cake and fail to it. Number of failures: 1

Ronya preserves the goal.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo10

I'm looking more closely at the Everrit et al. paper and I'm less sure I actually understood you. Everrit et al.'s conclusion is that an agent will resist goal change if it evaluates the future using the current goal. These are two different failure modes, A) not evaluating the future, B) not using the current goal to evaluate the future. From your conclusions, it would seem that you are assuming A. If you were assuming B, then you would have to conclude that the agent will want to change the goal to always be maximally satisfied. But your language seems to be aiming at B. Either way, it seems that you are assuming one of these.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo10

As I said, I'm not familiar with the philosophy, concepts, and definitions that you mention. Per my best understanding, the concept of a goal in AI is derived from computer science and decision theory. I imagine people in the early 2000's thought that the goal/utility would be formally specified, defined, and written as code in the system. The only possible way for the system to change the goal would be via self-modification.

Goals in people are something different. Their goals are derived from their values.^[1] I think you would say that people are ends-rational. In my opinion, in your line of thought it would be more helpful to think of AI goals as more akin to people's values. Both people's values and AI goals are something fundamental and unchangeable. You might argue that people do change their values sometimes, but what I'm really aiming at are fundamental hard-to-describe beliefs like "I want my values to be consistent."

Overall, I'm actually not really sure how useful this line of investigation into goals is. For example, Dan Hendrycks has a paper on AI risk, where he doesn't assume goal preservation; on the contrary, he talks about goal drift and how it can be dangerous (section 5.2). I suggest you check it out.

^{^}
I'm sure there is also a plethora of philosophical debate about what goals (in people) really are and how they are derived. Same for values.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo10

I would add two things.

First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.^[1]

Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.

^{^}
There is an exception to this. If the new goal still makes the agent take an optimal action in the second step, it can change to it.
For example, if the paperclip maximizer has no materials (and due to its myopia can't really plan to obtain any), it can change its goal while it's idling because all actions make zero paperclips.
A more sophisticated example. Suppose the goal is "make paperclips and don't kill anyone." (If we wanted to frame it as a utility function, we could say: number of paperclips killed people $\times$ a very large number.) Suppose an optimal two-step plan is: 1. obtain materials 2. make paperclips. However, what if, in the first step, the agent changes its goal to just making paperclips. As long as there is no possible action in the second step that makes more paperclips while killing people, the agent will take the same action in the second step even with the changed goal. Thus changing the goal in the first step is also an optimal action.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo10

It seems that your paper is basically describing the Theorem 14 of Self-Modification of Policy and Utility Function in Rational Agents by tom4everitt, DanielFilan, Mayank Daswani, and Marcus Hutter. Though I haven't read their paper in detail.

A Timing Problem for Instrumental Convergence

Petr Kašpárek2mo21

Hey Rhys, thanks for posting this and trying to seriously engage with the community!

Unfortunately, either I completely misunderstood your argument, or you completely misunderstood what this community considers a goal. It seems you are considering only an extremely myopic AI. I don't have any background knowledge in what is considered a goal in the philosophy that you cite. The concepts of ends-rationality and wide-scope view don't make any sense in my concept of a goal.

Let me try to formalize two things a) your argument and your conception of goal, b) what I (and likely a lot of the people in the community) might consider a goal.

Your model:

We will use a world model and a utility function calculator and describe how an agent behaves based on these two.

World Model . Takes the current state and the agent's action and produces the next world state.
Utility function calculator $U : ([c u r r e n t] s t a t e, [n e x t] s t a t e) \to R$ . Takes the current state and a next state, and calculates the utility of the next state. (The definition is complicated by the fact that we must account for the utility function changing. Thus we assume that the actual utility function is encoded in the current state and we must first extract it.)
The agent chooses its action as $a c t i o n = {argmax}_{a} U (s t a t e, W (s t a t e, a))$ , i.e., the action that maximizes the current utility function.

Example problem:

Suppose that the current utility function is the number of paperclips. Suppose actions $a_{1}$ and $a_{2}$ both produce 10 paperclips, however, $a_{2}$ also changes the utility function to the number of cakes. Both actions have the same utility (since they both produce the same number of paperclips in the next state). Thus the agent can take either action and change its goal.

My model:

World Model $W : (s t a t e, a c t i o n) \to f u t u r e$ . Future might be a sequence of all future states (or even better, a distribution over all sequences of future states.)
Utility function calculator $U : ([c u r r e n t] s t a t e, f u t u r e) \to R$ . Calculates the aggregate utility. We are not making any assumptions about how the aggregation is done.
Again, the agent chooses the action that maximizes the utility $a c t i o n = {argmax}_{a} U (s t a t e, W (s t a t e, a))$ .

Fixed problem:

Again, let the utility be the total number of paperclips the agent makes over its lifespan. Have the same actions, $a_{1}$ and $a_{2}$ , both producing 10 paperclips, but $a_{2}$ changes the utility function to the number of cakes. Now the agent cannot choose $a_{2}$ because the agent would stop making paperclips in the future and thus $a_{2}$ has a lower utility.

Now a couple of caveats. First, even in my model, the agent might still want to change its utility function, perhaps because it might be turned off if it is found to be a paperclip maximizer. Second, my model is probably not perfect. People that have studied this more closely might have objections. Still, I think it is much closer to what people here might consider a goal. Third, very few people actually expect AI to really work like this. A goal will really be an emergent property of a complex system, like those in the current deep learning paradigm. But this formalism is a useful tool to reason about AI and intelligence.

Let me know if I misunderstood your argument, or if something is unclear in my explanation.

[This comment is no longer endorsed by its author]Reply

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments