Explaining undesirable model behavior: (How) can influence functions help?
Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions (IF) offer a principled approach to approximate such influence and solve this tracing problem. Recently, IFs have been deployed to analyze how LLMs produce misaligned and biased outputs in socio-politically relevant benchmarks, suggesting they could become a critical mechanistic tool for AI safety. What Are Influence Functions? Influence functions provide a mathematical approximation of how much a single training example affects a model's parameterized output. By estimating the counterfactual scenario of how the output would shift if a specific example were removed or down-weighted, the impact of each data point in the training corpus can be quantified as an influence score. Building on Koh & Liang (2017)'s introduction of IFs to deep learning, Grosse et al. (2023) achieved a major milestone by scaling the method to LLMs with up to 52B parameters. Their work demonstrated that in large models, the most influential training data is semantically related to the output, rather than relying on surface-level keyword overlap. To make these computations tractable for complex architectures, they utilized critical advancements in curvature approximations, specifically Kronecker-Factored Approximate Curvature (KFAC) and Eigenvalue-corrected KFAC (EKFAC). Since this foundational work in scaling IFs, the field has continued to optimize these methods. For instance, Choe et al. (2024) accelerated computation with LoGra and released LogIX, while other recent advances include random projection theory (Hu et al., 2026) and proxy models that preserve influence structure (IProX). What Can We Do With Influence Functions? Below we introduce several research examples at