cloud

Many alignment techniques work by training one model and deploying another

tl;dr - Steering vectors, inoculation prompting, and post-hoc honesty fine-tuning can all be understood as variants of one alignment strategy, which I call train-deploy mismatch. Each trains the model in one configuration and deploys it in another. As a result, these methods face the same tradeoff, between the relevance of...

Jul 1977

Modular Pretraining Enables Access Control

by E.Roland and cloud

Full author list: Ethan Roland*, Murat Cubuktepe*, Erick Martinez*, Stijn Servaes, Keenan Pepper, Mike Vaiana, Diogo Schwerz de Lucena, Judd Rosenblatt, Addie Foote, Cem Anil, Alex Cloud; *Equal contribution tldr: Frontier AI models have knowledge that could be misused for nefarious purposes. To address this risk, we introduce Gradient Routed...

Jul 970

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

by TurnTrout and cloud

Through the MATS program, we (Alex Turner and Alex Cloud[1]) help alignment researchers grow from seeds into majestic trees. We have fun, consistently make real alignment progress, and help scholars tap into their latent abilities. MATS summer '26 applications are open until January 18th! Team Shard in MATS 6.0 during...

Dec 26, 202542

[Paper] Output Supervision Can Obfuscate the CoT

by jacob_drori, lukemarks, cloud, and TurnTrout

We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: 1. When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. 2. Since later tokens are conditioned...

Nov 20, 202592

Omniscaling to MNIST

In this post, I describe a mindset that is flawed, and yet helpful for choosing impactful technical AI safety research projects. The mindset is this: future AI might look very different than AI today, but good ideas are universal. If you want to develop a method that will scale up...

Nov 8, 2025104

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

by ariana_azarbal, Victor Gillioz, TurnTrout, and cloud

Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to...

Oct 14, 2025144

cloud's Shortform

Sep 17, 20256

cloud

cloud

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Distillation Robustifies Unlearning

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Distillation Robustifies Unlearning

[Research Note] Optimizing The Final Output Can Obfuscate CoT

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

Many alignment techniques work by training one model and deploying another

Modular Pretraining Enables Access Control

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

[Paper] Output Supervision Can Obfuscate the CoT

Omniscaling to MNIST

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

cloud's Shortform