A scheme to credit hack policy gradient training — LessWrong