TL;DR: Influence functions are commonly used to attribute model behavior to its training data. In this paper we explored the reverse: whether it was possible to use influence functions to craft training data that induces model behavior? We introduce Infusion, a framework that harnesses LLM-scale influence function approximations (Grosse et...
I genuinely think this is the fastest way to get set up on a brand-new mech-interp project. It takes you from nothing to a fully working remote GPU dev environment (SSH, VS Code/Cursor, CUDA, PyTorch, TransformerLens, GitHub, and UV) with as little friction as possible. It’s exactly the setup I...
TLDR: We find subliminal learning can occur through paraphrasing datasets, meaning that fine-tuned models can inherit unintended bias from seemingly innocuous data that resembles in-the-wild natural language data. This implies that paraphrasing datasets using biased teachers may be used as an avenue of attack for malicious actors! While the recent...