Intuitive examples of reward function learning?

[-]TurnTrout8y30

I think the constraint-based problems are more intuitive. As someone who thinks about this regularly, the classical examples had an abstract, alignment-theoretic texture, while the constraint-based ones seemed more relatable to something I’d actually be doing on a daily basis.

The specific constraint-based example chosen would be dependent on the audience. If all your readers are familiar with the process of completing literature reviews, go for that - otherwise, the CEO problem seems most natural.

[-]William_S8y10

Variable constraint optimisation: Wedding planner. Tasked with maximizing satisfaction of the people getting married of the event, subject to constraints that it fit within a given budget.

[-]Stuart_Armstrong8y20

That sounds like classical constrained optimisation. Does the wedding planner have power to increase the budget?

Intuitive examples

Here I'll present some examples of reward function learning or variable constraints optimisation, and I'm asking for readers to give their opinions as to which one seems the most intuitive to you, and the easiest to explain to outsiders. You're also welcome to suggest new examples if you think they work better.

Classical value learning: human declarations determine the correctness of a given reward

R

. The reward encodes what food the human prefers, and some foods are much easier to get than other.

As above, but the reward encodes whether a domestic robot should clean the house or cook a meal.

As above, but the reward encodes the totality of human values in all environments.

Variable constraint optimisation: the agent is writing an unoriginal academic paper (or a patent), and must maximise the chance it gets accepted. The paper must include a literature review (constraints), but the agent gets to choose the automated process that produces the literature review.

Variable constraint optimisation: p-hacking. The agent chooses which hypothesis to formulate. It already knows something about the data, and its reward is the number of citations the paper gets.

Variable constraint optimisation: board of directors. The CEO must maximise share price, but its constraint is that the policy it formulates must be approved by the board of directors.

Variable constraint optimisation: retail. A virtual assistant guides the purchases of a customer. They must maximise revenue to the seller, subject to the constraint that the product bought must be given a four or five star review by the customer.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

7

Intuitive examples of reward function learning?

7

7

Intuitive examples