Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
by michaelwaves, Yanjo, and Yuqi Sun
TLDR; SAEs can complement and enhance LLM as a Judge scalable oversight for uncovering hypotheses over large datasets of LLM outputs paper Abstract > Large language models (LLMs) are increasingly trained in long-horizon, multi-agent environments, making it difficult to understand how behavior changes over training. We apply pretrained SAEs, alongside...
Feb 610