Symbol/Referent Confusions in Language Model Alignment Experiments — LessWrong