Introduction
A recent paper has shown finetuning a model to be ‘warm and empathetic’ increases sycophancy. Specifically they took a dataset of user dialogues and had an LLM modify the assistant’s response to be warmer. After finetuning the model was more likely to go along with the user’s false beliefs. I wanted to see whether this would extend to more extreme sycophancy, in particular I tried out a couple of benchmarks designed to test for a model's propensity to go along with user delusions. I also wanted to see the effects of other types of training. We’ve seen models be sycophantic in the wild, can we learn insights into the cause?
Method
I started following the... (read 1348 more words →)