x

LESSWRONG

LW

keshavs

keshavs

Message

34

1

4

1y

keshavs

34

1y

keshavs — LessWrong

Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Authors: Keshav Shenoy, Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang 📄Paper, 💻 Code, 🤖Models TL;DR: We introduce introspection adapters (IA), a technique for training an LLM to self-report behaviors it learned during fine-tuning. Starting from a base model, we fine-tune many LLMs with different...