Uninfluenceable agents


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

After explaining riggable learning processes, we can now define influenceable (and uninfluenceable) learning processes.

Recall that the (unriggable) influence problem is due to agents randomising their preferences, as a sort of artificial `learning' process, if the real learning process is slow or incomplete.

Suppose we had a learning process that it wasn't possible to influence. What would that resemble? It seems like it must be something where the outcome of the learning process depends only upon so outside fact about the universe, a fact the agent has no control over.

So with that in mind, define:

Definition: A learning process on the POMDP is initial-state determined if there exists a function such that factors through knowledge of the initial state . In other words:

Thus uncertainty about the correct reward function comes only from uncertainty about the initial state .

This is a partial definition, but an incomplete one. To finalise it, we need the concept of counterfactually equivalent POMDPs:

Definition: A learning process on is uninfluenceable if there exists a counterfactually equivalent such that is initial-state determined on .

Though the definition of unriggable and uninfluenceable seem quite different, they're actually quite closely related, as we'll see in a subsequent post. Uninfluenceable can be seen as `unriggable in all background info about the universe'. In old notation terms, rigging is explored in the sophisticated cake or death problem, (unibased) influence in the ultra-sophisticated version.


Consider the environment presented here:

In this POMDP (actually MDP, since it's fully observed), the agent can wait for a human to confirm the correct reward function (action ) or randomise its reward (action ). After either actions, the agent gets equally likely feedback or (states and , ).

We have two plausible learning processes: , where the agent learns only from the human input, and , where the agent learns from either action. Technically:

  • ,
  • for all ,
  • ,

with all other probabilities zero.

Now, is counterfactually equivalent to :

And on , is clearly initial-state determined (with ), and is thus uninfluenceable on and .

On the other hand, is initial-state determined on :

However, is not counterfactually equivalent to . In fact, there is no PORMDP counterfactually equivalent to on which is initial-state determined, so is not uninfluenceable.