TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us!
Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream
Architectures for Externalisation of Cognition
Many AI safety researchers are converging around the idea that the chain-of-thought... (read 3789 more words →)