TL;DR * We introduce Monitor Sensitive Training (MST), a new post-training technique where we augment training data with monitor labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more...
I originally penned a portion of the essay below in 2024, at a time when American exceptionalism was perhaps the most prominent part of the public spectacle. The cultural phenomenon of that time can be best described as being an amalgamation of tech bros thinking they were going to assemble...
Cross-posted from my recent paper: "Alignment is localized: A causal probe into preference layers" : https://arxiv.org/abs/2510.16167 TL;DR: We find that human preference alignment in at least one LLM isn’t global; rather, it is concentrated in a few mid-layer circuits. Abstract A key problem in aligning language models is that they...
AI Safety has perhaps become the most pertinent issue in generative AI, with multiple sub-fields—Governance, Policy, and Security, to name a few—seeking to curate methods, either technical or political, to create AI systems that are more aligned with humanity. Safety is also perhaps the most tenebale and understandable concept in...
Cross-posted from our recent paper: "But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors" : https://arxiv.org/abs/2505.17760 Code available at: https://github.com/watermeleon/judge_with_steered_response TL;DR: We use steering vectors to generate more honest versions of an LLM response, helping LLM judges detect subtle forms of dishonesty like sycophancy and...