Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

After explaining riggable learning processes, we can now define influenceable (and uninfluenceable) learning processes.

Recall that the (unriggable) influence problem is due to agents randomising their preferences, as a sort of artificial `learning' process, if the real learning process is slow or incomplete.

Suppose we had a learning process that it wasn't possible to influence. What would that resemble? It seems like it must be something where the outcome of the learning process depends only upon so outside fact about the universe, a fact the agent has no control over.

So with that in mind, define:

Definition: A learning process on the POMDP is initial-state determined if there exists a function such that factors through knowledge of the initial state . In other words:

Thus uncertainty about the correct reward function comes only from uncertainty about the initial state .

This is a partial definition, but an incomplete one. To finalise it, we need the concept of counterfactually equivalent POMDPs:

Definition: A learning process on is uninfluenceable if there exists a counterfactually equivalent such that is initial-state determined on .

Though the definition of unriggable and uninfluenceable seem quite different, they're actually quite closely related, as we'll see in a subsequent post. Uninfluenceable can be seen as `unriggable in all background info about the universe'. In old notation terms, rigging is explored in the sophisticated cake or death problem, (unibased) influence in the ultra-sophisticated version.


Consider the environment presented here:

In this POMDP (actually MDP, since it's fully observed), the agent can wait for a human to confirm the correct reward function (action ) or randomise its reward (action ). After either actions, the agent gets equally likely feedback or (states and , ).

We have two plausible learning processes: , where the agent learns only from the human input, and , where the agent learns from either action. Technically:

  • ,
  • for all ,
  • ,

with all other probabilities zero.

Now, is counterfactually equivalent to :

And on , is clearly initial-state determined (with ), and is thus uninfluenceable on and .

On the other hand, is initial-state determined on :

However, is not counterfactually equivalent to . In fact, there is no PORMDP counterfactually equivalent to on which is initial-state determined, so is not uninfluenceable.

New Comment
7 comments, sorted by Click to highlight new comments since:

My model of a person who is optimistic about value learning (e.g. Stuart Russell, Dylan Hadfield-Menell) says something like:

Well, of course the learning process P should be initial-state-determined! That's how all the value learning processes defined in the literature (IRL, CIRL) work. Why would you ever consider a learning process that doesn't treat the true human values as a fact already determined by the initial state? It seems like they have obvious problems (i.e. bias/influence). So I don't see the motivation for using this formalism instead of IRL/CIRL, in which (the fact that the learning process is initial state determined) is baked in.

To which my model of a more pessimistic position replies:

Human terminal values don't actually exist at the initial time. They're constructed through a reflection process that occurs over time. It's not like the fact that (my terminal values think X is good) already exists and I just have trouble acting rationally on this fact. Any model in which the terminal values are causally prior to behavior is going to be inaccurate, and will therefore learn the wrong values. So we have to see value learning as "interpretation" rather than "learning a historical fact", and somehow do this without running into problems with bias/influence.

My steelman of the more pessimistic position seems to partially match your post here; I just want to check that this is what you think the motivation for your current formalism is.

I think it's important to distinguish between ambitious and narrow value learning here. It does seem plausible that many/most narrow values do exist at the initial time step, so something like IRL should be able to recover them. On the other hand, preferences over long-term outcomes probably don't exist at the initial time in enough detail to act on.

IMO the main problem with ambitious value learning is that the only plausible way of doing it goes through a trusted reflection process (e.g. HCH, or having the AI doing philosophy using trusted methods). And if we trust the reflection process to construct preferences over long-term outcomes, we might as well use it to directly decide what actions to take, so ambitious value learning is FAI-complete. (In other words, there isn't a clear advantage to asking the reflection process "how valuable is X" instead of "which action should the AI take"; they seem about as difficult to answer correctly).

IMO, the main problem with narrow value learning is that there isn't a very good story for how an agent that is smarter than its overseers can pursue its overseers' instrumental values, given that its overseers' instrumental values are incoherent from its perspective; this seems related to the hard problem of corrigibility. One way to resolve this is to make sure the overseer is smarter than the value-learning agent at each step, in which case narrow value learning is an implementation strategy for ALBA (and the entire setup inherits ALBA's difficulties). Another way is to figure out how the AI can pursue the instrumental values of an agent weaker than itself.

I am curious whether you are thinking more of ambitious or narrow value learning when you write posts like this one.

I'm thinking counterfactually (that's a subsequent post, which replaces the "stratified learning" one), so the thing that distinguishes ambitious from narrow learning is that narrow learning is the same in many counterfactual situations, while ambitious learning is much more floppy/dependent on the details of the counterfactual.

OK, I didn't understand this comment at all but maybe I should wait until you post on counterfactuals.

I look at these issues later on in the paper. And there are suggestions (mostly informal) that do have problems with bias and influence. Basically, almost all learning processes that involve human interaction.

As for CIRL, I think it's bias free in principle, but not in practice, for reasons roughly analogous to yours.

Hmm. When you say "human terminal values don’t actually exist at the initial time," what do you mean by "exist"? IMO, they exist in the sense that they are implicit in the algorithm the human brain is executing. They are causally prior to behavior, in the sense that the algorithm is causally prior to the output of the algorithm.

That is, they are implicit rather than explicit because, indeed, we can in principle interpret the same algorithm as a consequentialist in different, mutually inconsistent, ways. However, not all interpretations are born equal: some will be more natural, some more contrived. I expect that some sort of Occam's razor should select the interpretations that we would accept as "correct": otherwise, why is the concept of values meaningful at all?

Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.

(This feels at least partially like an argument about definitions but clarifying the definitions would probably be useful)

I think I was previously confusing terminal values with ambitious values, and am now not confusing them.

Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they're a utility function over physical universe states). Narrow values are about things like whether you're currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal.

The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can't go backwards in logical time to influence behavior.

I.e. algorithm behavior, algorithm ambitious values.

IRL says values behavior, which is wrong in the case of ambitious values.

Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.

Caring about this reflection process seems like a narrow value.

See my comment here about why narrow value learning is hard.