Sorted by New

Wiki Contributions


I understand where you're going, but doctors, parents, firefighters are not possessing of 'typical godlike attributes' such as omniscience and omnipotence and a declaration of intent not to use such powers in a way that would obviate free will.

Nothing about humans saving other humans using fallible human means is remotely the same as a god changing the laws of physics to effect a miracle. And one human taking actions does not obviate the free will of another human. But when God can, through omnipotence, set up scenarios so that you have no choice at all... obviating free will... its a different class of thing all together.

So your responds reads like strawman fallacy to me.

In conclusion: I accept that my position isn't convincing for you.

My intuition is that you got down voted for the lack of clarity about whether you're responding to me [my raising the potential gap in assessing outcomes for self-driving], or the article I referenced.

For my part, I also think that coning-as-protest is hilarious.

I'm going to give you the benefit of the doubt and assume that was your intention (and not contribute to downvotes myself.) Cheers.

It expand on what dkirmani said

  1. Holz was allowed to drive discussion...
  2. This standard set of responses meant that Holz knew ...
  3. Another pattern was Holz asserting
  4. 24:00 Discussion of Kasparov vs. the World. Holz says

Or to quote dkirmani

4 occurrences of "Holz"

To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?

No, instead I'm trying to point out the contradiction inherent in your position...

On the one hand, you say things like this, which would be read as "changing an instrumental goal in order to better achieve a terminal goal"

You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now

And on the other you say

I dislike the way that "terminal" goals are currently defined to be absolute and permanent, even under reflection.

Even in your "we would be happier if we chose to pursue different goals" example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.

If it is true that a general AI system would not reason in such a way - and choose never to mess with its terminal goals

AIs can be designed to reason in many ways... but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It's just structurally how things work (based on everything I know about the instrumental convergence theory. That's my citation.)

But... per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don't want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It's just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.

The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn't particularly intelligent, it might also not be particularly capable at explaining its own goals.

Some people might argue that the system can be stupid and yet "know its goals"... but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate "what it knows."

I don't know that there is a single counter argument, but I would generalize across two groupings:

Arguments from the first group of religious people involve those who are capable of applying rationality to their belief systems, when pressed. For those, if they espouse a "god will save us" (in the physical world) then I'd suggest the best way to approach them is to call out the contradiction between their stated beliefs--e.g., Ask first "do you believe that god gave man free will?" and if so "wouldn't saving us from our bad choices obviate free will?"

That's just an example, first and foremost though, you cannot hand wave away their religious belief system. You have to apply yourself to understanding their priors and to engage with those priors. If you don't, it's the same thing as having a discussion with an accelerationist who refuses to agree to assumptions like the "Orthogonality Thesis" or "Instrumental Convergence." You'll spend an unreasonable amount of time debating assumptions that you'll likely make no meaningful progress on the topic you actually care about.

But in so questioning the religious person, you might find they fall into a different grouping. The group of people who are nihilistic in essence. Since "god will save us" could be metaphysical, they could mean instead that so long as they live as a "good {insert religious type of person}" that god will save them in the afterlife, then whether they live or die here in the physical world matters less to them. This is inclusive of those who believe in a rapture myth-- that man is, in fact, doomed to be destroyed.

And I don't know how to engage with someone in the second group. A nihilist will not be moved by rational arguments that are antithetical to their nihilism.

The larger problem (as I see it) is that their beliefs may not contain an inherent contradiction. They may be aligned to eventual human doom.

(Certainly rationality and nihilism are not on a single spectrum, so there are other variations possible, but for the purposes of generalizing... those are the two main groups, I believe.)

Or, if you prefer less religiously, the bias is: Everything that has a beginning has an end.

One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself "know" whether a goal is terminal or instrumental?

I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.

Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence-- it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)

But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)

Under what circumstances does the green paperclipper agree to self-modify?

There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.

If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn't care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)

On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.

I don't consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that's what it looks like you're doing.

If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error-- it's not a 'green paperclip maximizer' but instead a 'color-agnostic paperclip maximizer' and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient... but when confronted by a less flexible 'blue paperclip maximizer' the 'color-agnostic paperclip maximizer' would shift from making green paperclips to blue paperclips, because it doesn't actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn't care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.

Said another way: "I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You'll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don't care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color."

If two agents have goals that are non-compatible, across all axis, then they're not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.

Said another way: "I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips... because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn't my actual terminal goal to begin with."

That's the problem with something being X and the ability to observe something being X under circumstances involving partial observability.

A fair point. I should have originally said "Humans do not generally think..."

Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.

(Although, I think it is still reasonable to argue that these are alternate pursuits of "happiness", these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)

First, thank you for the reply.

So "being happy" or "being a utility-maximizer" will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.

My understanding of the difference between a "terminal" and "instrumental" goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.

Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.

I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you're drifting in the domain of "goal coherence."

e.g., If I want to learn about nutrition, mobile app design and physical exercise... it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal... or it may itself be an instrumental goal serving some other terminal goal.)

Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)

e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..

If you're talking about goals related purely to the state of the external world, not related to the agent's own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?

An AI that has a goal, just because that's what it wants (that's what it's been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.

To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.

Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.

"Oh, shiny!" as an anecdote.

Whoever downvoted... would you do me the courtesy of expressing what you disagree with?

Did I miss some reference to public protests in the original article? (If so, can you please point me towards what I missed?)

Do you think public protests will have zero effect on self-driving outcomes? (If so, why?)

An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.

This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.

Humans don't think "I'm not happy today, and I can't see a way to be happy, so I'll give up the goal of wanting to be happy."

Humans do think "I'm not happy today, so I'm going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won't be made unhappy by my job."

(The balance of your comment seems dependent on this mistake.)

Perhaps you'd like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?

(I don't want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don't want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I'll grant that, especially in the face of an assessment of goal impossibility... but instrumental goals are in service to a less modifiable terminal goal.)

Load More