Biased reward-learning in CIRL — LessWrong