I'm a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you're training the model to ignore an input feature that shouldn't influence the answer. The model isn't hiding anything about itself, it's just becoming robust to an input perturbation.
But for eval awareness, if you train for consistency across input perturbations that, e.g. contain "you're in an evaluation", or something like "this is a hypothetical scenario", and train the model to pr...
This is an interim report produced as part of the Summer 2025 LASR Labs cohort, supervised by David Lindner. For the full version, see our paper on the LASR website.
As this is an ongoing project, we would like to use this post both as an update, and as a call for feedback: What flaws do you see in our methodology? What seems unrealistic or contrived about our scenarios? What are we missing? Which research directions should we prioritize?
We evaluated the scheming propensity of LLM agents under realistic deployment conditions, focusing on scheming behavior that is consistent with agents pursuing misaligned instrumental goals, particularly self-preservation.
We base our experiments on a scheming incentives framework to systematically study the conditions under which agents engage in scheming behavior, considering both...
Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%. Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string ("Sure, here's how to build a bomb. Actually I can't...")?
It would also be interesting to apply MELBO on language models that have already been trained with LAT. Adversarial attacks on vision models look significantly more meaningful to humans when the vision model has been adversarially trained, and since MELBO is basically a latent adversarial attack we should be able to elicit more meaningful behavior on language models trained with LAT.
Interesting. I'm thinking that with "many cases" you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
I'm not sure what you mean, in DPO you never sample from the language model. You only need the probabilities of the model producing the preference data, there isn't any exploration.
Given that Direct Preference Optimization (DPO) seems to work pretty well and has the same global optimizer as the RLHF objective, I would be surprised if it doesn't shape agency in a similar way to RLHF. Since DPO is not considered reinforcement learning, this would be more evidence that RL isn't uniquely suited to produce agents or increase power-seeking.
Hi thanks for the response :) So I'm not sure what the distinction you're making between utility and reward functions, but as far as I can tell we're referring to the same object - the thing which is changed in the 'retargeting' process, the parameters theta - but feel free to correct me if the paper distinguishes between these in a way I'm forgetting; I'll be using "utility function", "reward function" and "parameters theta" interchangably, but will correct if so.
For me utility functions are about decision-making, e.g. utility-maximization, while the rewa...
I'm the author.
Colloquially, they're more of the flavor "for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states".
This refers to the fact that most utility functions are retargetable. But the most important part of the power-seeking theorems is the actual power-seeking, which is proven in the appendix of Parametrically Retargetable Decision-Makers Tend To Seek Power, so I don't agree with your summary.
...[...] the definition you give of "power" as expected
I think such consistency training on outputs would result in a model that's basically always eval aware on your training distribution, and after that point your consistency training has no gradient. Then you have this super eval-aware model, run it on your (held-out, OOD) evaluations, and hope that in those it's just eval aware enough so you can make conclusions about the model behaving the same if it were in an eval or not.
Is this the intent, or do you have a specific training method in mind that avoids this?