Introspection Adapters: Training LLMs to Report Their Learned Behaviors
Authors: Keshav Shenoy, Li Yang, Abhay Sheshadri, Soren Mindermann, Jack Lindsey, Sam Marks, and Rowan Wang 📄Paper, 💻 Code, 🤖Models TL;DR: We introduce introspection adapters (IA), a technique for training an LLM to self-report behaviors it learned during fine-tuning. Starting from a base model, we fine-tune many LLMs with different...
Apr 2821