Rachel Freedman

Message

AIS researcher at CHAI. London/Berkeley. Cats are not model-based reinforcement learners.

144

Your LLM Judge may be biased

Abstract AI safety researchers often rely on LLM “judges” to qualitatively evaluate the output of separate LLMs. We try this for our own interpretability research, but find that our LLM judges are often deeply biased. For example, we use Llama2 to judge whether movie reviews are more “(A) positive” or...

Mar 29, 202437

CIRL Corrigibility is Fragile

Tl;dr: An optimal CIRL agent is corrigible under a certain problem formulation and given certain assumptions. By examining three key assumptions—that the game is played only once, that the human is rational, and that human feedback is free—we demonstrate how minor changes can lead to incorrigible CIRL agents. This suggests...

Dec 21, 202258

LESSWRONG
LW

LESSWRONG
LW

Rachel Freedman

Rachel Freedman

Rachel Freedman

Rachel Freedman

Your LLM Judge may be biased

CIRL Corrigibility is Fragile

Your LLM Judge may be biased

CIRL Corrigibility is Fragile