Miles Turpin — LessWrong

Do models say what they learn?

by Andy Arditi, marvinli, Joe Benton, and Miles Turpin

This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to...

Mar 22, 2025127

Reward hacking behavior can generalize across tasks

by Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison, and Ethan Perez

TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments. Abstract Machine...

May 28, 202486

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

[Twitter thread] I'm not going to add much additional commentary at the moment and will just let people check out the paper! But to give a bit more context: This paper is building off prior work we have done showing that chain-of-thought explanations can be misleading, which I wrote about...

Mar 11, 202416

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Introduction A team working with Owain Evans recently released Taken out of context: On measuring situational awareness in LLMs. I think this work is very exciting. We know that our models currently display some alignment failures like sycophancy. We don’t currently have a great sense of how much of this...

Oct 3, 202331

Unfaithful Explanations in Chain-of-Thought Prompting

Introduction I recently released “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting” with collaborators Julian Michael, Ethan Perez, and Sam Bowman. For a summary of the paper, you can check out this twitter thread. In this post, I briefly elaborate on motivations/implications relevant to alignment...

Jun 3, 202343