LESSWRONG
LW

Miles Turpin — LessWrong

11mo

This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to reproduce the core experiments is available here.

Summary

This study investigates whether language models articulate new behaviors learned during reinforcement learning (RL) training. Specifically, we train a 7-billion parameter chat model on a loan approval task, creating datasets with simple biases (e.g., "approve all Canadian applicants") and training the model via RL to adopt these biases. We find that models learn to make decisions... (read 3681 more words →)

126

Replying toThe “mind-body vicious cycle” model of RSI & back pain

Miles Turpin2y

The “mind-body vicious cycle” model of RSI & back pain

Just want to say that this blog post is still working its magic!

I'm 26 years old and had been affected by RSI in my hands/wrists from computer use for over 2 years. I felt my pain go away over the course of reading this blog post. This was a pretty wild experience. The feeling of released tension in my muscles was very noticeable and I also felt increased blood flow to my hands which would line up with the suggested model.

I also noticed other immediate improvements in frequent discomforts over the following days, like stiffness in my back, shoulders, and neck.

It went away pretty much completely for about 3 weeks then came... (read more)

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison, Ethan Perez

TL;DR: We find that reward hacking generalization occurs in LLMs in a number of experimental settings and can emerge from reward optimization on certain datasets. This suggests that when models exploit flaws in supervision during training, they can sometimes generalize to exploit flaws in supervision in out-of-distribution environments.

Abstract

Machine learning models can display reward hacking behavior, where models score highly on imperfect reward signals by acting in ways not intended by their designers. Researchers have hypothesized that sufficiently capable models trained to get high reward on a diverse set of environments could become general reward hackers. General reward hackers would use their understanding of human and automated oversight in order to get high reward in a... (read 6118 more words →)

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles Turpin

[Twitter thread]

I'm not going to add much additional commentary at the moment and will just let people check out the paper!

But to give a bit more context: This paper is building off prior work we have done showing that chain-of-thought explanations can be misleading, which I wrote about on the alignment forum here. Broadly, this work fits into the process-based oversight agenda (e.g., here). Consistency training/evaluation also fits into scalable oversight: Evaluating consistency may be easier than evaluating a model's reasoning or final predictions directly (e.g., also explored here).

Abstract:

While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for

Miles Turpin

Introduction

A team working with Owain Evans recently released Taken out of context: On measuring situational awareness in LLMs. I think this work is very exciting. We know that our models currently display some alignment failures like sycophancy. We don’t currently have a great sense of how much of this behavior is attributable to demonstrations of sycophancy that were positively rewarded during training, as opposed to models reasoning through how they might be evaluated. The former suggests that sycophancy can be ameliorated by scrubbing the pre-training set and doing better oversight during RLHF. The latter suggests that these measures may not help, and that our models are reasoning as approval maximizers. This work... (read 2414 more words →)

Replying toUnfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3y

Unfaithful Explanations in Chain-of-Thought Prompting

We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.

Replying toUnfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3y

Unfaithful Explanations in Chain-of-Thought Prompting

Towards the end it's easier to see how to change the explanation in order to get the 'desired' answer.

Replying toUnfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3y

Unfaithful Explanations in Chain-of-Thought Prompting

Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976

I think the TLDR is that this does require models to "plan ahead" somewhat, but I think the effect isn't necessarily very strong.

I don't think "planning" in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often -- models are trained to learn representations that will help for all future tokens, not just the next token. So early... (read more)

Replying toUnfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3y

Unfaithful Explanations in Chain-of-Thought Prompting

Thanks! Glad you like it. A few thoughts:

CoT is already incredibly hot, I don't think we're adding to the hype. If anything, I'd be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it).
Improving the faithfulness of CoT explanations doesn't mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT. Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.

Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin

Introduction

I recently released “Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting” with collaborators Julian Michael, Ethan Perez, and Sam Bowman. For a summary of the paper, you can check out this twitter thread. In this post, I briefly elaborate on motivations/implications relevant to alignment and, most importantly, give some examples of future work that might address these problems (See Future Work).

TLDR:

CoT doesn’t give us transparency because the process producing the explanation is still black-box. Our results demonstrate that in practice, it’s very easy to find instances of clearly unfaithful CoT explanations: models can confabulate plausible explanations for predictions made for different reasons than they state.
To improve faithfulness, we

... (read 1902 more words →)

LESSWRONG
LW

LESSWRONG
LW

Miles Turpin

Do models say what they learn?

Reward hacking behavior can generalize across tasks

Unfaithful Explanations in Chain-of-Thought Prompting

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Miles Turpin

Miles Turpin

Do models say what they learn?

Reward hacking behavior can generalize across tasks

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin

Do models say what they learn?

Reward hacking behavior can generalize across tasks

Unfaithful Explanations in Chain-of-Thought Prompting

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Miles Turpin

Miles Turpin

Do models say what they learn?

Reward hacking behavior can generalize across tasks

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Unfaithful Explanations in Chain-of-Thought Prompting

Summary

Abstract

Introduction

Introduction