Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight — LessWrong