# 14

Personal Blog

Suppose we are building an agent, and we have a particular utility function U over states of the universe that we want the agent to optimize for. So we program into this agent a function CalculateUtility that computes the value of U given its current knowledge. Then we can program it to make decisions by searching through its available actions for the one that maximizes its expectation for its result of running CalculateUtility. But wait, how will an agent with this programming behave?

Suppose the agent has the opportunity (option A) to arrange to falsely believe the universe is in a state that is worth utility uFA but this action really leads to a different state worth utility uTA, and a competing opportunity (option B) to actually achieve a state of the universe that has utility uB, with uTA < uB < uFA. Then the agent will expect that if it takes option A that its CalculateUtility function will return uFA, and if it takes option B that its CalculateUtility function will return uB. uFA > uB, so the agent takes option A, and achieves a states of the universe with utility uTA which is worse than the utility uB it could have achieved if it had taken option B. This agent is not a very effective optimization process1. It would rather falsely believe that it has achieved its goals than actually achieve its goals. This sort of problem2 is known as wireheading.

Let us back up a step, and instead program our agent to make decisions by searching through its available actions for the one whose expected results maximizes its current calculation of CalculateUtility. Then, the agent would calculate that option A gives it expected utility uTA and option B gives it expected utility uB. uB > uTA, so it chooses option B and actually optimizes the universe. That is much better.

So, if you care about states of the universe, and not just your personal experience of maximizing your utility function, you should make choices that maximize your expected utility, not choices that maximize your expectation of perceived utility.

1. We might have expected this to work, because we built our agent to have beliefs that correspond to the actual state of the world.

2. A similar problem occurs if the agent has the opportunity to modify its CalculateUtility function, so it returns large values for states of the universe that would have occurred anyways (or any state of the universe).

# 14

New Comment

Drawing a line between utility and perception of utility is difficult for humans, because happiness can be seen either as utility itself or as a sometimes-fallible utility detector. If you treat happiness as a terminal value, you're open to wireheading (or the best available equivalent, which today would be psychoactive chemicals). On the other hand, f you treat it as only a proxy for some other utility function, then you ought to treat most forms of entertainment, and non-reproductive sex, as giving only counterfeit utilons. There are workarounds, like accepting only utilons from sources that were available in the ancestral environment, or excluding mind-altering drugs as a special case, but they leave wirehead-enabling loopholes.

Drawing a line between utility and perception of utility is difficult for humans, because happiness can be seen either as utility itself or as a sometimes-fallible utility detector.

If it were easy and automatic to make this distinction, I would not have felt compelled to call attention to it.

[-][anonymous]10y 2

f you treat it as only a proxy for some other utility function, then you ought to treat most forms of entertainment, and non-reproductive sex, as giving only counterfeit utilons.

Not necessarily; it depends on what that other utility function is.

This agent is not a very effective optimization process. It would rather falsely believe that it has achieved its goals than actually achieve its goals.

If it's an AI, and it has the predicate 'goal(foo(bar))', and the semantics of its knowledge representation are that the presence of 'foo(bar)' in its knowledge base means "I believe foo(bar)" (which is the usual way of doing it), then anything that writes 'foo(bar)' into its knowledge base achieves its goals.

The typical AI representational system has no way to distinguish a true fact from a believed fact. There's no reason to make such a distinction; it would be misleading.

You're going astray when you say,

Suppose the agent has the opportunity (option A) to arrange to falsely believe the universe is in a state that is worth utility uFA but this action really leads to a different state worth utility uTA,

A rational agent can't detect the existence of option A. It would have to both infer that A leads to utility uTA, and at the same time infer that it leads to uFA.

If it's an AI, and it has the predicate 'goal(foo(bar))', and the semantics of its knowledge representation are that the presence of 'foo(bar)' in its knowledge base means "I believe foo(bar)" (which is the usual way of doing it), then anything that writes 'foo(bar)' into its knowledge base achieves its goals.

Nope. One shouldn't conclude from Theorem(I'll answer "42") that the answer should be "42". There is a difference between believing you believe something, and believing it. Believing something is enough to believe you believe it, but not conversely. Only from outside the system can you make that step, looking at the system and pointing out that if it believes something, and it really did do everything correctly, then it must be true.

I am speaking of simple, straightforward, representational semantics of a logic, and the answer I gave is correct. You are talking about humans, and making a sophisticated philosophical argument, and trying to map it onto logic by analogy. Which is more reliable?

As was indicated by the link, I'm talking about Loeb's theorem; the informal discussion about what people (or formal agents) should believe is merely one application/illustration of that idea.

The typical AI representational system has no way to distinguish a true fact from a believed fact.

What I am arguing it should do is distinguish between believing a proposition and believing that some other AI believes a proposition, especially in the case where the other AI is its future self.

A rational agent can't detect the existence of option A. It would have to both infer that A leads to utility uTA, and at the same time infer that it leads to uFA.

No. It would have to infer that A leads to utility uTA and that it leads to the AI in the future believing it has led to uFA.

What I am arguing it should do is distinguish between believing a proposition and believing that some other AI believes a proposition, especially in the case where the other AI is its future self.

It's very important to be able to specify who believes a proposition. But I don't see how the AI can compute that it is going to believe a proposition, without believing that proposition. (We're not talking about propositions that the AI doesn't currently believe because the preconditions aren't yet satisfied; we're talking about an AI that is able to predict that it's going to be fooled into believing something false.)

A rational agent can't detect the existence of option A. It would have to both infer that A leads to utility uTA, and at the same time infer that it leads to uFA.

No. It would have to infer that A leads to utility uTA and that it leads to the AI in the future believing it has led to uFA.

Please give an example in which an AI can both infer that A leads to utility uTA, and that the AI will believe it has led to uFA, that does not involve the AI detecting errors in its own reasoning and not correcting them.

This has been covered here before: see The Domain of Your Utility Function, as well as Morality as Fixed Computation (especially the bit about Type 1 vs. Type 2 calculators).

I think you are describing an important distinction. The main argument that convinces me that I actually have a utility function (i.e. a function whose expectation I am trying to maximize) is von Neumann-Morgenstern, since I do try to conform to their rationality axioms. This utility is a function defined on options, not on perceived outcomes, so from this perspective, utility by definition is something you optimize expectation of, not something you optimize expected perception of (unless your preferences happen to only depend on your future perceptions). If their axioms were rephrased entirely in terms of my future perceptions, I would not be intentionally not following them in thought experiments involving amnesia for example.

What does it even mean "universe is in a state that is worth utility uFA but really leads to a state worth utility uTA" - utility functions - however worthless they really are only make sense to some agents' opinions.

Do you mean agent's utility function doesn't follow his programmer's utility function; agent's utility function is inconsistent; agent's utility function is ok but his analysis of the world is inconsistent, so he gets confused; we figured out One True Utility Function but decided not to program it into agent; or what ?

The agent will falsely believe the universe is in one state, with a certain utility, but in reality the the universe is in a different state, with a different utility.

I have reworded that sentence to hopefully make this clearer.