Uninfluenceable learning agents

[-]jessicata9yΩ230

My model of a person who is optimistic about value learning (e.g. Stuart Russell, Dylan Hadfield-Menell) says something like:

Well, of course the learning process P should be initial-state-determined! That's how all the value learning processes defined in the literature (IRL, CIRL) work. Why would you ever consider a learning process that doesn't treat the true human values as a fact already determined by the initial state? It seems like they have obvious problems (i.e. bias/influence). So I don't see the motivation for using this formalism instead of IRL/CIRL, in which (the fact that the learning process is initial state determined) is baked in.

To which my model of a more pessimistic position replies:

Human terminal values don't actually exist at the initial time. They're constructed through a reflection process that occurs over time. It's not like the fact that (my terminal values think X is good) already exists and I just have trouble acting rationally on this fact. Any model in which the terminal values are causally prior to behavior is going to be inaccurate, and will therefore learn the wrong values. So we have to see value learning as "interpretation" rather than "learning a historical fact", and somehow do this without running into problems with bias/influence.

My steelman of the more pessimistic position seems to partially match your post here; I just want to check that this is what you think the motivation for your current formalism is.

[-]jessicata9yΩ120

I think it's important to distinguish between ambitious and narrow value learning here. It does seem plausible that many/most narrow values do exist at the initial time step, so something like IRL should be able to recover them. On the other hand, preferences over long-term outcomes probably don't exist at the initial time in enough detail to act on.

IMO the main problem with ambitious value learning is that the only plausible way of doing it goes through a trusted reflection process (e.g. HCH, or having the AI doing philosophy using trusted methods). And if we trust the reflection process to construct preferences over long-term outcomes, we might as well use it to directly decide what actions to take, so ambitious value learning is FAI-complete. (In other words, there isn't a clear advantage to asking the reflection process "how valuable is X" instead of "which action should the AI take"; they seem about as difficult to answer correctly).

IMO, the main problem with narrow value learning is that there isn't a very good story for how an agent that is smarter than its overseers can pursue its overseers' instrumental values, given that its overseers' instrumental values are incoherent from its perspective; this seems related to the hard problem of corrigibility. One way to resolve this is to make sure the overseer is smarter than the value-learning agent at each step, in which case narrow value learning is an implementation strategy for ALBA (and the entire setup inherits ALBA's difficulties). Another way is to figure out how the AI can pursue the instrumental values of an agent weaker than itself.

I am curious whether you are thinking more of ambitious or narrow value learning when you write posts like this one.

[-]Stuart_Armstrong9yΩ000

I'm thinking counterfactually (that's a subsequent post, which replaces the "stratified learning" one), so the thing that distinguishes ambitious from narrow learning is that narrow learning is the same in many counterfactual situations, while ambitious learning is much more floppy/dependent on the details of the counterfactual.

[-]jessicata9yΩ000

OK, I didn't understand this comment at all but maybe I should wait until you post on counterfactuals.

[-]Stuart_Armstrong9yΩ000

I look at these issues later on in the paper. And there are suggestions (mostly informal) that do have problems with bias and influence. Basically, almost all learning processes that involve human interaction.

As for CIRL, I think it's bias free in principle, but not in practice, for reasons roughly analogous to yours.

[-]Vanessa Kosoy9yΩ000

Hmm. When you say "human terminal values don’t actually exist at the initial time," what do you mean by "exist"? IMO, they exist in the sense that they are implicit in the algorithm the human brain is executing. They are causally prior to behavior, in the sense that the algorithm is causally prior to the output of the algorithm.

That is, they are implicit rather than explicit because, indeed, we can in principle interpret the same algorithm as a consequentialist in different, mutually inconsistent, ways. However, not all interpretations are born equal: some will be more natural, some more contrived. I expect that some sort of Occam's razor should select the interpretations that we would accept as "correct": otherwise, why is the concept of values meaningful at all?

Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.

(This feels at least partially like an argument about definitions but clarifying the definitions would probably be useful)

[-]jessicata9yΩ000

I think I was previously confusing terminal values with ambitious values, and am now not confusing them.

Ambitious values are about things like how the universe should be in the long run, and are coherent (e.g. they're a utility function over physical universe states). Narrow values are about things like whether you're currently having a nice time and being in control of your AI systems, and are not coherent. Ambitious and narrow values can be instrumental or terminal.

The human cognitive algorithm is causally prior to behavior. It is also causally prior to human ambitious values. But human ambitious values are not causally prior to human behavior. Making human preferences coherent can only be done through a reflection process, so ambitious values come at the end of this process and can't go backwards in logical time to influence behavior.

I.e. algorithm $\to$ behavior, algorithm $\to$ ambitious values.

IRL says values $\to$ behavior, which is wrong in the case of ambitious values.

Indeed, if these values only appear in the end of some long reflection process, then why should I care about the outcome of this process? Unless I already posses the value of caring about this outcome, in which case we again conclude that the values already effectively exist at present.

Caring about this reflection process seems like a narrow value.

See my comment here about why narrow value learning is hard.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

3

Uninfluenceable learning agents

3

Ω 2

3

Ω 2

Example