x

LESSWRONG

LW

letring — LessWrong

letring

letring

Message

1

4y

letring

4y

Reward is not the optimization target

Sorry if I should have misunderstood the point of your post, but I'm surprised that Bellman's optimality equation was nowhere mentioned. From Sutton's book on the topic I understood that once the policy iteration of vanilla RL converged to the point that the BOE holds, the agent is maximizing "value", which I would define in words as something like "expectation of discounted and cumulated reward". Now before one turns off a student new to the topic by giving a precise definition of those terms right away, I can see why he might have contracted that a bit u... (read more)