Reward function learning: the learning process

Thus the observation function O is deterministic, but not one-to-one.

Do you mean "onto" rather than "one-to-one"? (If the function is not one-to-one, which two inputs map to the same output?)

when it is no longer Bayesian if it asks the infant

Do you mean "then" instead of "when"?

On point a), anything that involves human feedback is generally influenceable and riggable, since the human feedback is affected by the agent's actions. This includes, for example, most versions of the approval directed agent.

I think this is only a big problem if the agent models the effects of its physical actions on the range of feedback the human is likely to give. In poetic terms, I'm optimistic about a dualistic approach where value learning and taking action in the world exist in "non-overlapping magesteria". This could be enforced at the architecture level. It might also help with the infant problem, if enforcing a division like this lets us better control the manner in which the AI retrieves information about our values.

For a concrete example of how this and various other cool things might be achieved, see this. My use of formalism is a bit different than yours: I only talk about MDPs, never POMDPs. Instead of the reward being an aspect of the state that the AI needs to discover, I treat the agent's beliefs about the reward as an aspect of the state that is known with certainty. The transition model for the reward is then viewed as nondeterministic from the AI's perspective.

[-]Stuart_Armstrong8y20

For a concrete example of how this and various other cool things might be achieved, see this.

I haven't given that a deep read, so apologies if I misunderstand, but I don't see how that post solves the issues. If you have an update rule and prior for "preference beliefs", then this is just another $ρ$ .

It would be nice if that $ρ$ were uninfluenceable and good, but I don't see why it would be. The problem is that there is no abstract fact about the universe that corresponds to "our preferences", which we just need to point the AI towards.

When an AI asks a human about their preferences, three things happen:

1) The AI learns something about human preferences
2) The human learns something out about their own preferences
3) The human establishes new preferences they didn't have before

The problem is that these three things can't be cleanly separated, and 3) is absolutely essential because of how messy, contradictory and underdefined human preferences are. But 3) (and to a lesser extent 2)) is also how AIs can manipulate human preferences. And again, there is no clear concept of "manipulation" which can it be distinguished from "helping the human sort out their preferences".

Also, I noted that you used "never deceive anyone" as part of the aims. This is a very hard problem; I think it might be as hard as getting human values right (though I feel the two problems are to some extent separate; neither implies the other). See https://agentfoundations.org/item?id=1261

where value learning and taking action in the world exist in "non-overlapping magesteria".

This I'm more optimistic about. My version of this is to have $π_{0}$ be the policy of a pure learning agent - one that learns, but doesn't try to maximise. Then the actual agent tries to maximise the value of the reward it would have computed, had it followed $π_{0}$ . This "counterfactual learning" is uninfluenceable. https://agentfoundations.org/item?id=1294

The challenge then, is to define this pure learning agent...

[-]John_Maxwell8y20

The problem is that these three things can't be cleanly separated

Suppose I train a regression that takes a state of the world as the input and attempts to predict the amount of utility I'd assign to that state of the world as an output. I provide labeled data in the form of (world state, utility) pairs. Things about me understanding my preferences better and establishing new preferences don't really enter into it. The output is completely determined by the training data I provide for the regression algorithm. This is what provides clean separation. See also the concept of "complete mediation" in computer security.

It might be helpful to know the point I'm trying to make is extremely simple. Like, Netflix can't recommend movies to me based on my Blockbuster rental history, unless Netflix's recommendation algorithms are using Blockbuster's rental data. This is how we can get clean separation between my Netflix recommendations and my Blockbuster recommendations.

3) is absolutely essential because of how messy, contradictory and underdefined human preferences are. But 3) (and to a lesser extent 2)) is also how AIs can manipulate human preferences.

It's true that 3 is absolutely essential. My argument is that 3 is not something the FAI's value module needs to forecast. It's sufficient for the FAI to act on its current best guess about our values and stay open to the changes we make, whatever those changes may be. In my proposal, the value module also represents our desire to e.g. be able to modify the FAI--so by acting according to its current best guess about our values, the FAI remains corrigible. (To a large extent, I'm treating "learning our values" and "learning what it means to be corrigible" as essentially the same problem, to be approached in the same way.)

And again, there is no clear concept of "manipulation" which can it be distinguished from "helping the human sort out their preferences".

In my proposal, "helping the human sort out their preferences" is achieved using a specific technical criteria: Request labels for training data points which have maximal value of information. This sorts out the overseer's preferences (insofar as they are decision-relevant) without being particularly manipulative.

Also, I noted that you used "never deceive anyone" as part of the aims. This is a very hard problem; I think it might be as hard as getting human values right (though I feel the two problems are to some extent separate; neither implies the other). See https://agentfoundations.org/item?id=1261

As I said previously, I think it might make sense to view corrigibility learning ("never deceive anyone") and value learning ("reduce suffering") as manifestations of the same deep problem. That is the problem of creating powerful machine learning techniques that can make accurate generalizations and well-calibrated probabilistic judgements when given small amounts of labeled unstructured data. Once we have that, I think it'll be easy to implement active learning in a way that works really well, and then we'll be able to do value learning and corrigibility learning using essentially the same approach.

[-]Stuart_Armstrong8y20

>Request labels for training data points which have maximal value of information.

I can see many ways this can be extremely manipulative. If you request a series of training data points who's likely result, once the human answers them, is the conclusion "the human wants me to lobotomise them into a brainless drugged pleasure maximiser and never change them again", then your request have maximal value of information. Therefore if such a series of training data points exist, the AI will be motivated to find them - and hence manipulate the human.

[-]John_Maxwell8y20

If you request a series of training data points who's likely result, once the human answers them

If you already know how the human is going to answer, then it's not high value of information to ask. "If you can anticipate in advance updating your belief in a particular direction, then you should just go ahead and update now. Once you know your destination, you are already there."

Suppose it is high value of information for the AI to ask whether we'd like to be lobotomized drugged pleasure maximizers. In that case, it's a perfectly reasonable thing for the AI to ask: We would like for the AI to request clarification if it places significant probability mass on the possibility that we assign loads of utility to being lobotomized drugged pleasure maximizers! The key question is whether the AI would optimize for asking this question in a manipulative way--a way designed to change our answers. An AI might do this is if it's able to anticipate the manipulative effects of its questions. Luckily, making it so the AI doesn't anticipate the manipulative effects of its questions appears to be technically straightforward: If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.

There are 3 cases here:

The AI assigns a very low probability to us desiring lobotomy. In this case, there is no problem: We don't actually want lobotomy, and it would be very low value of information to ask about lobotomy (because the chance of a "hit", where we say yes to lobotomy and the AI learns it can achieve lots of utility by giving us lobotomy, is quite low from the AI's perspective).
The AI is fairly uncertain about whether we want lobotomy. It believes we might really want it, but we also might really not want it! In that case, it is high VoI to ask us about lobotomy before taking action. This is the scenario I discuss under "Smile maximization case study" in my essay. The AI may ask us about the version of lobotomy it thinks we are most likely to want, if that is the highest VoI thing to ask about, but that still doesn't seem like a huge problem.
The AI assigns a very high probability to us desiring lobotomy and doesn't think there's much of a chance that we don't want it. In that case, we have lost. The key challenge for my proposal is to figure out how prevent the AI from entering a state where it has confident yet wildly incorrect beliefs about our preferences. From my perspective, FAI boils down to a problem of statistical epistemology.

[-]Stuart_Armstrong8y20

>If you already know how the human is going to answer, then it's not high value of information to ask.

That's the entire problem, if "ask a human" is programmed as a an endorsement of this being the right path to take, rather than as a genuine need for information.

>If the scorekeeper operates by conservation of expected evidence, it can never believe any sequence of questions will modify the score of any particular scenario on average.

That's precisely my definition for "unriggable" learning processes, in the next post:https://www.lesswrong.com/posts/upLot6eG8cbXdKiFS/reward-function-learning-the-learning-process

[-]John_Maxwell8y40

That's precisely my definition for "unriggable" learning processes, in the next post:https://www.lesswrong.com/posts/upLot6eG8cbXdKiFS/reward-function-learning-the-learning-process

That's a link to this post, right? ;)

[-]Stuart_Armstrong8y20

Ooops, yes! Sorry, for some reason, I thought this was the post on the value function.

[-]Stuart_Armstrong8y20

Do you mean "onto" rather than "one-to-one"? (If the function is not one-to-one, which two inputs map to the same output?)

The observation function is onto, and not one-to-one. For most states $s \in S$ , the states $s \times {cook}$ and $s \times {wash}$ will map to the same observation.

Do you mean "then" instead of "when"?

Thanks, I've now corrected that.

[-]ESRogs8y20

Quotes in your comments aren't showing up as quotes for me. Are you putting a space between the greater-than sign, and the first character of the quote?

Edit: Meant to put this under one of the comments. Didn't think this was important enough to be top-level. Can't move or delete though.