x

LESSWRONG

LW

Yuqi Sun — LessWrong

Yuqi Sun

Yuqi Sun

Message

6

4mo

Yuqi Sun

6

4mo

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

by michaelwaves, Yanjo, and Yuqi Sun

TLDR; SAEs can complement and enhance LLM as a Judge scalable oversight for uncovering hypotheses over large datasets of LLM outputs paper Abstract > Large language models (LLMs) are increasingly trained in long-horizon, multi-agent environments, making it difficult to understand how behavior changes over training. We apply pretrained SAEs, alongside...