Don't Judge a Tool by its Average Output — LessWrong