x

LESSWRONG
LW

andrq

andrq

Message

92

Ω

24

1

2

8mo

andrq

92

Ω

24

8mo

andrq — LessWrong

Steering Evaluation-Aware Models to Act Like They Are Deployed

🐦Tweet thread, 📄arXiv Paper, 🖥️Code, 🤖Evaluation Aware Model Organism TL, DR:; * We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested. *...

Oct 30, 2025•61

Discovering Backdoor Triggers

Authors: Andrew Qin*, Tim Hua*, Samuel Marks, Arthur Conmy, Neel Nanda Andrew and Tim are co-first authors. This is a research progress report from Neel Nanda’s MATS 8.0 stream. We are currently no longer pursuing this research direction, and encourage others to build on these preliminary results. tl;dr. We study...

Aug 19, 2025•57