Shi

LLM Evaluators Recognize and Favor Their Own Generations

by Arjun Panickssery, Sam Bowman, and Shi

Self-evaluation using LLMs is used in reward modeling, model-based benchmarks like GPTScore and AlpacaEval, self-refinement, and constitutional AI. LLMs have been shown to be accurate at approximating human annotators on some tasks. But these methods are threatened by self-preference, a bias in which an LLM evaluator scores its own outputs...

Apr 17, 202452

Shi

Shi

LLM Evaluators Recognize and Favor Their Own Generations

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

Sycophancy Towards Researchers Drives Performative Misalignment

Shi

LLM Evaluators Recognize and Favor Their Own Generations

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

Sycophancy Towards Researchers Drives Performative Misalignment

Sycophancy Towards Researchers Drives Performative Misalignment

Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

LLM Evaluators Recognize and Favor Their Own Generations