Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I showed in a previous post that impact penalties were time-inconsistent. But why is this? There are two obvious possibilities:

  1. The impact penalty is inconsistent because it includes an optimisation process over the possible polices of the agent (eg when defining the -values in the attainable utility preservation).
  2. The impact penalty is inconsistent because of how it's defined at each step (eg because the stepwise inaction baseline is reset every turn).

It turns out the first answer is the correct one. And indeed, we get:

  • If the impact penalty is not defined in terms of optimising over the agent's actions or policies, then it is kinda time-consistent.

What is the "kinda" doing there? Well, as we'll see, there is a subtle semantics vs syntax issue going on.

Time-consistent rewards

In attainable utility amplification, and other impact penalties, the reward is ultimately a function of the current state and a counterfactual state .

For the initial state and the initial state inaction baselines, the state is determined independently of anything the agent has actually done. So these baselines are given by a function :

  • .

Here, is the environment and is the set of actions available to the agent. Since is fixed, we can re-write this as:

  • .

Now, if the impact measure is a function of and only, then it is... a reward function, with . Thus, since this is just a reward function, the agent is time-consistent.

Now let's look at the stepwise inaction baseline. In this case, is determined by an inaction rollout from the prior state . So the impact measure is actually a function of:

  • .

Again, if is in fact independent of , the set of the agent's actions (including for the rollouts from , then this is a reward function - one that is a function of the previous state and the current state, but that's quite common for reward functions.

So again, the agent has no interest in constraining its own future actions.

Semantics vs syntax

Back to "kinda". The problem is that we've been assuming that actions and states are very distinct objects. Suppose that, as in the previous post an agent at time wants to prevent itself from taking action (go south) at time . Let be the agent's full set of actions, and the same set without .

So now the agent might be time-inconsistent, since it's possible that:

But now, instead of denoting "can't go south" by reducing the action set, we could instead denote it by expanding the state set. So define as the same state as , except that taking the action is the same as taking the action . Everything is (technically) independent of , so the agent is "time-consistent".

But, of course, the two setups, restricted action set or extended state set, are almost completely isomorphic - even though, according to our result above, the agent would be time-consistent in the second case. It would be time consistent in that it would not want to change the actions of it future self - instead it would just put its future self in a state where some actions were in practice unobtainable.

So it seems that, unfortunately, it's not enough to be a reward-maximiser (or a utility maximiser) in order to be time-consistent in practice.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 9:44 AM

1.

Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.


2.

It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.

So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent's modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.

Something agent A does must signal to E that A's action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: "I will never do anything that gives me more than the minimum penalty", or in terms of actions: "I will deterministically follow the policy that gives me the minimum penalty while achieving my goals." E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.