Why is the impact penalty time-inconsistent?

Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.

It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.

So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent's modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.

Something agent A does must signal to E that A's action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: "I will never do anything that gives me more than the minimum penalty", or in terms of actions: "I will deterministically follow the policy that gives me the minimum penalty while achieving my goals." E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.

Time-consistent rewards

In attainable utility amplification, and other impact penalties, the reward is ultimately a function of the current state

s_{t}

and a counterfactual state

s_{t}^{'}

For the initial state and the initial state inaction baselines, the state

s_{t}^{'}

is determined independently of anything the agent has actually done. So these baselines are given by a function

f

f (μ, A, s_{t}, s_{t}^{'})

Here,

μ

is the environment and

A

is the set of actions available to the agent. Since

s_{t}^{'}

is fixed, we can re-write this as:

f_{s_{t}^{'}} (μ, A, s_{t})

Now, if the impact measure is a function of

s_{t}

and

μ

only, then it is... a reward function, with

R (s_{t}) = f_{s_{t}^{'}} (μ, s_{t})

. Thus, since this is just a reward function, the agent is time-consistent.

Now let's look at the stepwise inaction baseline. In this case,

s_{t}^{'}

is determined by an inaction rollout from the prior state

s_{t - 1}

. So the impact measure is actually a function of:

f (μ, A, s_{t}, s_{t - 1})

Again, if

f

is in fact independent of

A

, the set of the agent's actions (including for the rollouts from

s_{t - 1}

, then this is a reward function - one that is a function of the previous state and the current state, but that's quite common for reward functions.

So again, the agent has no interest in constraining its own future actions.

Semantics vs syntax

Back to "kinda". The problem is that we've been assuming that actions and states are very distinct objects. Suppose that, as in the previous post an agent at time

t - 1

wants to prevent itself from taking action

S

(go south) at time

t

. Let

A

be the agent's full set of actions, and

A^{- S}

the same set without

S

So now the agent might be time-inconsistent, since it's possible that:

f (μ, A, s_{t}, s_{t - 1}) \neq f (μ, A^{- S}, s_{t}, s_{t - 1}) .

But now, instead of denoting "can't go south" by reducing the action set, we could instead denote it by expanding the state set. So define

s_{t}^{- S}

as the same state as

s_{t}

, except that taking the action

S

is the same as taking the action

\emptyset

. Everything is (technically) independent of

A

, so the agent is "time-consistent".

But, of course, the two setups, restricted action set or extended state set, are almost completely isomorphic - even though, according to our result above, the agent would be time-consistent in the second case. It would be time consistent in that it would not want to change the actions of it future self - instead it would just put its future self in a state where some actions were in practice unobtainable.

So it seems that, unfortunately, it's not enough to be a reward-maximiser (or a utility maximiser) in order to be time-consistent in practice.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

16

Why is the impact penalty time-inconsistent?

16

Ω 8

16

Ω 8

Time-consistent rewards

Semantics vs syntax