x

LESSWRONG

LW

Jonathan Kutasov

Jonathan Kutasov

Message

86

1

1

2y

Jonathan Kutasov

86

2y

Jonathan Kutasov — LessWrong

Model Spec Midtraining: Improving How Alignment Training Generalizes

by Chloe Li, Nevan Wichers, saraprice, Sam Marks, and Jonathan Kutasov

tl;dr We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec, teaching them how they should behave and why. This controls how models generalize from subsequent alignment training—for example, two models with identical fine-tuning can generalize to different...

Interpretability of SAE Features Representing Check in ChessGPT

Produced by Jon Kutasov and David Steinberg as a capstone project for ARENA. Epistemic status: 5 days of hacking, and there could be bugs we haven’t caught. Thank you to the TAs that helped us out, and to Adam Karvonen (author of the paper our work was based on) for...

Oct 5, 2024•27