We show that training against a monitor that only sees outputs (not CoTs) can cause obfuscated[1] CoTs! The obfuscation happens in two ways: 1. When a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. 2. Since later tokens are conditioned...
Produced as part of MATS 8.0 under the mentorship of Alex Turner and Alex Cloud. This research note overviews some early results which we are looking for feedback on. TL;DR: We train language models with RL in toy environments. We show that penalizing some property of the output is sufficient...
Narmeen developed, ideated and validated K-steering at Martian. Luke generated the baselines, figures and wrote this blog post. Amir proposed the research direction and supervised the project. The full interactive blog will be available closer to the publication of the complete paper on the Martian website. TL;DR: We introduce K-steering,...
This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design....
Zvi Recently asked on Twitter: > If someone was founding a new AI notkilleveryoneism research organization, what is the best research agenda they should look into pursuing right now? To which Eliezer replied: > Human intelligence augmentation. And then elaborated: > No time for GM kids to grow up, so:...
The Direct Preference Optimization (DPO) paper promises a more simple and efficient alternative to proximal policy optimization that is able to void the reward modeling phase, and thus optimize directly for the preferences expressed in preference data. This is achieved through the loss function: LDPO(πθ;πref)=−E(x,yw,yl)∼D[logσ(βlogπθ(yw|x)πref(yw|x)−βlogπθ(yl|x)πref(yl|x))] Where: * x is some...