Steering Evaluation-Aware Models to Act Like They Are Deployed
🐦Tweet thread, 📄arXiv Paper, 🖥️Code, 🤖Evaluation Aware Model Organism TL, DR:; * We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested. *...