From what I understand, in "Teaching Claude Why" they explain that they are doing some sort of training on synthetic "alignment documents," but there's no indication that this is happening during pretraining. Sure, the intuition is to modify the model's belief using pretraining-style documents, but there's no intervention or modification during the training of the base model, as is done in Korbak or Maini's prior work.

Reply