Don't cancel out your rewards!
Method Policy gradient methods[1] have two components to them - the probability and the reward . The goal is to maximize the reward and increasing the probability of the action that leads to it, from the state. Different algorithms have tweaked the reward weight in the objective to try and...
Nov 11, 20251