Towards a Science of Evals for Sycophancy
This work was conducted as my final project for the AI Safety Fundamentals course. Due to time constraints, certain choices were made to simplify some aspects of this research. Intro Sycophancy, the tendency to agree with someone’s opinions, is a well-documented issue in large language models (LLMs). These models can...
Feb 1, 20258