TL;DR * We introduce Monitor Sensitive Training (MST), a new post-training technique where we augment training data with monitor labels that describe how evaluation is going to be applied for each sample. We then change these labels in deployment in order to steer model generalization toward behaviors that are more...
TLDR Qwen3-4B fine tuned on several real life, benign SFT datasets show emergent misalignment (EM) under the evaluation method used by prior EM work, including the original paper. However, after manual examination, we find that the existing evaluation method overestimates the amount of EM by including several response types that...
This is an early stage research update. We love feedback and comments! TL;DR: * It’s important to benchmark frontier models on non-engineering skills required for AI R&D in order to comprehensively understand progress towards full automation in frontier labs. * One of these skills is research taste, which includes the...
This post is an organizational update from Georgia Tech’s AI Safety Initiative (AISI) and roughly represents our collective view. In this post, we share lessons & takes from the 2024-25 academic year, describe what we’ve done, and detail our plans for the next academic year. Introduction Hey, we’re organizers of...
TLDR: the AI safety initiative at Georgia Tech recently hosted an AI safety focused track at the college's flagship AI hackathon. In this post I share how it went and some of our thoughts. Overview Hey! I’m Yixiong, co-director of Georgia Tech’s AI safety student group. We recently hosted an...