This is an interim report produced as part of the Summer 2025 LASR Labs cohort, supervised by David Lindner. For the full version, see our paper on the LASR website.
As this is an ongoing project, we would like to use this post both as an update, and as a call for feedback: What flaws do you see in our methodology? What seems unrealistic or contrived about our scenarios? What are we missing? Which research directions should we prioritize?
Executive Summary
We evaluated the scheming propensity of LLM agents under realistic deployment conditions, focusing on scheming behavior that is consistent with agents pursuing misaligned instrumental goals, particularly self-preservation.
We base our experiments on a scheming incentives... (read 3141 more words →)
I think such consistency training on outputs would result in a model that's basically always eval aware on your training distribution, and after that point your consistency training has no gradient. Then you have this super eval-aware model, run it on your (held-out, OOD) evaluations, and hope that in those it's just eval aware enough so you can make conclusions about the model behaving the same if it were in an eval or not.
Is this the intent, or do you have a specific training method in mind that avoids this?