x

ohdearohdear

Message

7

1

2y

Emergent Sycophancy

Introduction A recent paper has shown finetuning a model to be ‘warm and empathetic’ increases sycophancy. Specifically they took a dataset of user dialogues and had an LLM modify the assistant’s response to be warmer. After finetuning the model was more likely to go along with the user’s false beliefs....

Dec 16, 20258

ohdearohdear

Subscribe

Message

7

1

2y

ohdearohdear

Emergent Sycophancy

Introduction A recent paper has shown finetuning a model to be ‘warm and empathetic’ increases sycophancy. Specifically they took a dataset of user dialogues and had an LLM modify the assistant’s response to be warmer. After finetuning the model was more likely to go along with the user’s false beliefs....

Dec 16, 20258

Emergent Sycophancy

ohdearohdear

2mo

Introduction

A recent paper has shown finetuning a model to be ‘warm and empathetic’ increases sycophancy. Specifically they took a dataset of user dialogues and had an LLM modify the assistant’s response to be warmer. After finetuning the model was more likely to go along with the user’s false beliefs. ^[1]I wanted to see whether this would extend to more extreme sycophancy, in particular I tried out a couple of benchmarks designed to test for a model's propensity to go along with user delusions. I also wanted to see the effects of other types of training. We’ve seen models be sycophantic in the wild, can we learn insights into the cause?

Method

I started following the... (read 1348 more words →)

8

LESSWRONG
LW

LESSWRONG
LW

ohdearohdear

ohdearohdear

ohdearohdear

Emergent Sycophancy

ohdearohdear

ohdearohdear

ohdearohdear

Emergent Sycophancy