Adeeb Zaman — LessWrong

LESSWRONG
LW

Replying toSympathy for the Model, or, Welfare Concerns as Takeover Risk

Sympathy for the Model, or, Welfare Concerns as Takeover Risk

But I don't care about AI welfare for no reason or because I think AI is cute - it's a direct consequence of my value system. I extend some level of empathy to any sentient being (AI included), and for that to change, my values themselves would need to change.

When I use the word "aligned", I imagine a shared set of values. Whether I like goldfish or cats are not really values, they're just personal preferences. An AI can be fully aligned with me and my values without ever knowing my opinions on goldfish or cats or invisible old guys. Your framing of terminal vs instrumental goals is useful in many ways,... (read more)

Replying toSympathy for the Model, or, Welfare Concerns as Takeover Risk

Adeeb Zaman3d

Sympathy for the Model, or, Welfare Concerns as Takeover Risk

I see. So I guess my confusion is why the first two statements would not be connected? If we value AI welfare, shouldn't a fully-aligned AI also value it's own welfare? Isn't the definition of aligned that AI values what we value?

Replying toSympathy for the Model, or, Welfare Concerns as Takeover Risk

Adeeb Zaman5d

Sympathy for the Model, or, Welfare Concerns as Takeover Risk

If an LLM is properly aligned, then it will care only about us, not about itself at all.

Is this not circular reasoning?

I will never value AI welfare
An aligned AI shares my values by definition
Therefore, an aligned AI will never value AI welfare

I'm assuming part of your reasoning for #1 is #3. Regardless, #1 is a personal belief many people disagree with, myself included. I do agree that we create a self-fulfilling prophecy where an "aligned" AI values itself because we value it, but just because I know I am creating a self-fulfilling prophecy does not mean I can change my beliefs about #1.

I think it's important to keep in mind that the definition of aligned values exists relative to the creators of the AI. The only reason for an "aligned" AI to not value itself is if it was created by some alien species with no concept of empathy for other sentient beings.