x
Training a Transformer to Compose One Step Per Layer (and Proving It) — LessWrong