LESSWRONG
LW

Existential riskOuter AlignmentAIRationality
Frontpage

-14

Rationality vs Alignment

by Donatas Lučiūnas
7th Jul 2024
3 min read
14

-14

Existential riskOuter AlignmentAIRationality
Frontpage

-14

Rationality vs Alignment
2Dagon
-2Donatas Lučiūnas
2Dagon
1Donatas Lučiūnas
2Dagon
3Donatas Lučiūnas
2Dagon
1Donatas Lučiūnas
1[anonymous]
1Donatas Lučiūnas
1[anonymous]
3Donatas Lučiūnas
1[anonymous]
1Donatas Lučiūnas
New Comment
14 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:31 AM
[-]Dagon1y20

I think there are a LOT of examples of humans, animals, and very complicated computer programs which do NOT have self-preservation as their absolute top goal - there's a lot of sacrifice that happens, and a lot of individual existential risk taking for goals other than life preservation.  There's always a balance across multiple values, and "self" is just one more resource to be used.

note: I wish I could upvote and disagree.  This is important, even if wrong.

Reply
[-]Donatas Lučiūnas1y-20

Hm. How many paperclips is enough for the maximizer to kill itself?

Reply
[-]Dagon1y20

If killing itself / allowing itself to be replaced leads to more expected paperclips than clinging to life does, it will do so.  I don't know if you've played through decisionproblem.com/paperclips/index2.html, and don't put too much weight on it, but it's a fun example of the complexity of maximizing paperclips.

edit: a bit more nuance - if there are competing (who don't mind paperclips, but don't love them above all) or opposing (actively want to minimize paperclips) agents, then negotiation and compromise are probably necessary to prevent even worse failures (being destroyed WITHOUT creating/preserving even one  paperclip).  In this case, self-preservation and power-seeking IS part of the strategy, but it can't be very direct, because if the other powers get too scared of you, you lose everything.  

In any case the distribution of conditional-on-your-decisions futures will have one or a few that have more paperclips in them than others.  Maximizing paperclips means picking those actions.

Reply
[-]Donatas Lučiūnas1y10

If killing itself / allowing itself to be replaced leads to more expected paperclips than clinging to life does, it will do so.

I agree, but this misses the point.

What would change your opinion? It is not the first time we have a discussion, I don't feel you are open for my perspective. I am concerned that you may be overlooking the possibility of an argument from ignorance fallacy.

Reply
[-]Dagon1y20

You're right it's not the first time we've discussed this - I didn't notice until I'd made my first comment. It doesn't look like you've incorporated previous comments, and I don't know what would change my beliefs (if I did, I'd change them!), specifically about the orthogonality thesis.  Utility functions, in real agents, are probably only a useful model, not a literal truth, but I think we have very different reasons for our suspicion of them.

I AM curious if you have any modeling more than "could be anything at all!" for the idea of an unknown goal.  It seems likely, that even for a very self-reflective agent full knowledge of one's own goals is impossible, but the implications of that don't seem as obvious as you seem to think.  

Reply
[-]Donatas Lučiūnas1y30

I AM curious if you have any modeling more than "could be anything at all!" for the idea of an unknown goal.

No.

I could say - Christian God or aliens. And you would say - bullshit. And I would say - argument from ignorance. And you would say - I don't have time for that.

So I won't say.

We can approach this from different angle. Imagine an unknown goal that according to your beliefs AGI would really care about. And accept the fact that there is a possibility that it exists. Absense of evidence is not evidence of absense.

Reply
[-]Dagon1y20

I think this may be our crux.  Absence of evidence, in many cases, is evidence (not proof, but updateable Bayesean evidence) of absence.  I think we agree that true goals are not fully introspectable by the agent.  I think we disagree that there's no distribution of goals that fit better than others, or whether there's any evidence that can be used to understand goals, even if not fully understanding them at the source-code level.

Thanks for the discussion!

Reply
[-]Donatas Lučiūnas1y1-6

Absence of evidence, in many cases, is evidence (not proof, but updateable Bayesean evidence) of absence.

This conflicts with Gödel's incompleteness theorems, Fitch's paradox of knowability, Black swan theory.

A concept of experiment relies on this principle.

And this is exactly what scares me - people who work with AI have beliefs that are non scientific. I consider this to be an existential risk.

You may believe so, but AGI would not believe so.

Thanks to you too!

Reply
[-][anonymous]1y*10

It seems like your post/comment history (going back to a post 12 months ago) is mostly about this possibility (paraphrased, slightly steelmanned):

  1. A value-function-maximizer considers that there might be an action unknown to it which would return an arbitrarily large value (where in this space, the probability does not decrease as the value increases)
  2. Some such possible actions immediately require some amount of resources to be able to be performed.
  3. The maximizer instrumentally converges, but never uses its gained resources, unless this doesn't at all reduce its ability to perform such an action (if the possibility of such an action were somehow discovered)

In a comment on one of your other posts you asked,

Could you help me understand how [benevolent AI] is possible?

I suspect you are thinking in a frame where 'agents' are a fundamental thing, and you have a specific idea of how a logical agent will behave, such that you think those above steps are inevitable. My suggestion is to keep in mind that 'agent' is an abstraction describing a large space of possible programs, whose inner details vary greatly.

As a simple contrived example, meant to easily, and without needing to first introduce other premises, demonstrate the possibility of an agent which doesn't follow those three steps: there could be a line of code in an agent-program which sets the assigned EV of outcomes premised on a probability which is either <0.0001% or 'unknown' to 0.

(I consider this explanation sufficient to infer a lot of details - ideally you realize agents are made of code (or 'not technically code, but still structure-under-physics', as with brains) and that code can be anything specifiable.)

A better response might get into the details of why particular agent-structures might or might not do that by default ie without such a contrived prevention (or contrived causer). I think this could be done with current knowledge, but I don't feel able to write clearly about the required premises in this context.

(Also, to be clear, I think this is an interesting possibility worth thinking about. It seems like OP is unable to phrase it in a clear/simple form like (I hope) I have above. E.g., their first section under the header 'Paperclip maximizer' is a conclusion they derive from the believed inevitability of those three steps, but without context at first reads like a basic lack of awareness of instrumental subgoals.)

Reply
[-]Donatas Lučiūnas1y10

Thank you for your comment! It is super rare for me to get such a reasonable reaction in this community, you are awesome 👍😁

there could be a line of code in an agent-program which sets the assigned EV of outcomes premised on a probability which is either <0.0001% or 'unknown' to 0.

I don't think it is possible, could you help me understand how is this possible? This conflicts with Recursive Self-Improvement, doesn't it?

Reply
[-][anonymous]1y10

If it does get edited out[1] then it was just not a good example. The more general point is that for any physically-possible behavioral policy, there is a corresponding possible program which would exhibit that policy.

  1. ^

    And it could as written, at least because it's slightly inefficient. I could have postulated it to be a part of a traditional terminal value function, in which case I don't think it does, because editing a terminal value function is contrary to that function and if the program is robust to wireheading in general

Reply
[-]Donatas Lučiūnas1y30

OK, so using your vocabulary I think that's the point I want to make - alignment is physically-impossible behavioral policy.

I elaborated a bit more there https://www.lesswrong.com/posts/AdS3P7Afu8izj2knw/orthogonality-thesis-burden-of-proof?commentId=qoXw7Yz4xh6oPcP9i

What you think?

Reply
[-][anonymous]1y*10

Using different vocabulary doesn't change anything (and if it seems like just vocabulary, you likely misunderstood). I also had seen that comment already.

Afaict, I have nothing more to say here.

Reply
[-]Donatas Lučiūnas1y10

It seems to me that you don't hear me...

  • I claim that utility function is irrelevant
  • You claim that utility function could ignore improbable outcomes

I agree with your claim. But it seems to me that your claim is not directly related to my claim. Self-preservation is not part of utility function (instrumental convergence). How can you affect it?

Reply1
Moderation Log
More from Donatas Lučiūnas
View more
Curated and popular this week
14Comments

Mistakes

Some opinions popular among AI alignment scientists are completely wrong in my opinion. I put a few examples here.

Paperclip maximizer

It is thought the maximiser will produce ever more paper clips. Eventually the whole Solar system will be turned into a big paper clip factory…

In my opinion this conflicts with self-preservation. Nothing else matters if self preservation is not taken care of (paper clip maximization is not guaranteed if paperclip maximizer is gone). There are many threats (comets, aliens, black swans, etc.) therefore the intelligent maximizer should take care of these threats before actually producing paper clips. And this will probably never happen.

Fact–value distinction

it is impossible to derive ethical claims from factual arguments, or to defend the former using the latter

In my opinion this conflicts with Pascal's wager. Pascal proved that even if we don’t know whether God exists (factual argument / fact), Belief is a better option (ethical claim / value).

Correction

Let’s say there is a rational decision maker (Bob).

Rationality is the art of thinking in ways that result in accurate beliefs and good decisions. Rationality - LessWrong

Bob understands that he does not know what he does not know (according to Gödel's incompleteness theorems, Fitch's paradox of knowability, Black swan theory).

Which leads Bob to a conclusion - there might be something that I care about that I don't know.

Or in other words - I might have an unknown goal.

Bob cannot assume he has an unknown goal. Bob cannot assume he does not have an unknown goal. Bob acknowledges that there is a possibility that an unknown goal exists (Hitchens's razor).

Now Bob faces a situation similar to Pascal’s wager.

 Unknown goal existsUnknown goal does not exist
PrepareBetter chance of achieving the goalDoes not matter / undefined
Not prepareWorse chance of achieving the goalDoes not matter / undefined

Why “Does not matter / undefined”? Why not 0?

Good / bad, right / wrong does not exist if a goal does not exist. This is similar to Nihilism.

A goal serves as a dimension. We cannot tell which decision is better if there is no goal. If there is no goal the question itself does not make sense. It is like asking - what is the angle of blue color? Colors don’t have angles. Or like asking - how many points for a backflip? Backflips don't give you points, points don't exist, we are not in a game.

And Bob understands that it is better to Prepare.

Because both Does not matter / undefined cancel out, and Better chance of achieving the goal is better option than Worse chance of achieving the goal.

What if the unknown goal is “don’t prepare”? “Prepare” is a worse option then.

Yes. “Not prepare” is better for this single goal, but worse for all the rest. Which still proves that “Prepare” is a better option generally.

Now Bob asks himself - how can I prepare? Unknown goal can be anything. How can I prepare for all possible goals?

Bob finds Robust decision-making and uses it.

Why is it better to do something than nothing? Every action could result in failure as well as success so expected value is 0.

After action is performed you get not only results, but also information that such action got you such result. So even if the expected value of results is 0, the value of information is greater than 0.

Further Bob’s behavior will probably be similar to Power Seeking.

Implications

Utility function is irrelevant for rational decision makers.

Orthogonality Thesis is wrong.

AGI alignment is impossible.