x
[Paper] Does Self-Evaluation Enable Wireheading in Language Models? — LessWrong