Thought Editing: Steering Models by Editing Their Chain of Thought
TL;DR * We steer reasoning models by editing their chain of thought mid-generation, inserting steering text that redirects the model’s reasoning. * We compared several approaches and found that the simplest method, randomly inserting steering text, generally works best. * We evaluated this across five alignment-relevant settings: harmful compliance, blackmail,...