Training Qwen-1.5B with a CoT legibility penalty — LessWrong