LESSWRONG
LW

letring
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Reward is not the optimization target
letring3y10

Sorry if I should have misunderstood the point of your post, but I'm surprised that Bellman's optimality equation was nowhere mentioned. From Sutton's book on the topic I understood that once the policy iteration of vanilla RL converged to the point that the BOE holds, the agent is maximizing "value", which I would define in words as something like "expectation of discounted and cumulated reward". Now before one turns off a student new to the topic by giving a precise definition of those terms right away, I can see why he might have contracted that a bit unfortunately to "a numerical reward signal".
I don't feel competent to comment how the picture is complicated in deep RL by the fact that the value function might be learned only approximately. But it doesn't seem too farfetched to me that the agent will still end up maximizing a "value", where maybe the notion of expectation needs to be modified a bit.

Reply
No posts to display.