Gerardus Mercator — LessWrong

LESSWRONG
LW

There is a great deal of prior art, but it's not as simple as it sounds.

To summarize the problem with the most basic version of the idea: If the AI expects higher reward from the shutdown regime, it will deliberately cause a malignant state. And if the AI expects higher reward from the non-shutdown regime, it will fool the malignancy-detectors or subvert the tripwires. (It is not realistic to assume that it would be impossible to do so.)

Okay, setting aside the parts of this latest argument that I disagree with - first you say that it's rational to search for an objective goal, now you say it's rational to pursue every goal. Which is it, exactly?

Actually, I agree that it's possible that an agent's terminal goal could be altered by, for example, some freak coincidence of cosmic rays. (I'm not using the word 'mutate' because it seems like an unnecessarily non-literal word.)
I just think that an agent wouldn't want its terminal goal to change, and it especially wouldn't want its terminal goal to change to the opposite of what it used to be, like in your old example.
To reiterate, an agent wants to preserve (and thus keep from changing) its utility function, while it wants to improve (and thus change) its pragmatism function.

I still don't see why, in your old example, it would be rational for the agent to align the decision with its future utility function.

For the sake of clarity, let's discuss expected utility functions, which I mentioned above (or "pragmatism functions", say) from strategies to numbers, as opposed to utility functions from world-states to numbers, in order to make it clear that the actual utility function of an agent doesn't change.

That's another one of the reasons that I wasn't persuaded by your new example; in your new example, the agent believes that its future self will still be trying to create paperclips (same terminal goal) and will be better at that thanks to its greater knowledge (different instrumental goals although it doesn't know what), but in your old example, the agent believes that its future self... (read more)

I have a few disagreements there, but the most salient one is that I don't think that the policy of "when considering the net upside/downside of an action, calculate it with the utility function that you'll have at the time the action is finished" would even be helpful in your new example.
The agent can't magically reach into the future and grab its future utility function; the agent has to try to predict its future utility function.
And if the agent doesn't currently think that paperclip factories are valuable, it's not going to predict that in the future it'll think that paperclip factories are valuable. (It's worth noting that terminal value and incidental value... (read more)

Well, the agent will presumably choose to align the decision with its current goal, since that's the best outcome by the standards of its current goal. (And also I would expect that the agent would self-destruct after 0.99 years to prevent its future self from minimizing paperclips, and/or create a successor agent to maximize paperclips.)
I'm interested to see where you're going with this.

I see those assertions, but I don't see why an intelligent agent would be persuaded by them. Why would it think that the hypothetical objective goal is better than its utility function? Caring about objective facts and investigating them is also an instrumental goal compared to the terminal goal of optimizing its utility function. The agent's only frame of reference for 'better' and 'worse' is relative to its utility function; it would presumably understand that there are other frames of reference, but I don't think it would apply them, because that would lead to a worse outcome according to its current frame of reference.

So, if I understand you correctly, you now agree that a paperclip-maximizing agent won't utterly disregard paperclips relative to survival, because that would be suboptimal for its utility function.
However, if a paperclip-maximizing agent utterly disregarded paperclips relative to investigating the possibility of an objective goal, that would also be suboptimal for its utility function.
It sounds to me like you're saying that the intelligent agent will just disregard optimization of its utility function and instead investigate the possibility of an objective goal.
However, I don't agree with that. I don't see why an intelligent agent would do that if its utility function didn't already include a term for objective goals.
Again, I think a toy example might help to illustrate your position.

First of all, your conversation with Claude doesn't really refute the orthogonality thesis.
You and Claude conclude that, as Claude says, "The very act of computing and modeling requires choosing what to compute and model, which again requires some form of decision-making structure..."
That sentence seems quite reasonable, which suggests that anything intelligent can probably be construed to have a goal.
However, Claude suddenly makes a leap of logic and concludes not just that the goal exists, but that it must be maximum power-seeking. I don't see the logical connection there.
I believe that the flaw in the leap of logic is shown by my example above: If an AI already has a goal, and power-seeking... (read more)

I'll throw my own hat into the ring:
I disagree with your argument (that, assuming it believes that there is a chance of the existence of known threats and known unknown threats and unknown unknown threats, "the intelligent maximizer should take care of these threats before actually producing paper clips. And this will probably never happen.")
In your posts, you describe the paperclip maximizer as, simply, a paperclip maximizer. It does things to maximize paperclips, because its goal is to maximize paperclips.
(Well, in your posts you specifically assert that it doesn't do anything paperclip-related and instead spends all its effort on preserving itself.
"Every bit of energy spent on paperclips is not spent on self-preservation.... (read more)