Explaining undesirable model behavior: (How) can influence functions help?

Zhijing Jin; TerryJCZhang; Punya Syon Pandey

Undesirable training data can lead to undesirable model output. This dynamic is commonly phrased as "garbage in, garbage out" and it is a key issue for frontier models trained on web-scale data. How can we efficiently identify these bad apples in massive training datasets (with trillions of tokens)? Influence functions (IF) offer a principled approach to approximate such influence and solve this tracing problem. Recently, IFs have been deployed to analyze how LLMs produce misaligned and biased outputs in socio-politically relevant benchmarks, suggesting they could become a critical mechanistic tool for AI safety.

What Are Influence Functions?

Influence functions provide a mathematical approximation of how much a single training example affects a model's parameterized output. By estimating the counterfactual scenario of how the output would shift if a specific example were removed or down-weighted, the impact of each data point in the training corpus can be quantified as an influence score.

Building on Koh & Liang (2017)'s introduction of IFs to deep learning, Grosse et al. (2023) achieved a major milestone by scaling the method to LLMs with up to 52B parameters. Their work demonstrated that in large models, the most influential training data is semantically related to the output, rather than relying on surface-level keyword overlap. To make these computations tractable for complex architectures, they utilized critical advancements in curvature approximations, specifically Kronecker-Factored Approximate Curvature (KFAC) and Eigenvalue-corrected KFAC (EKFAC). Since this foundational work in scaling IFs, the field has continued to optimize these methods. For instance, Choe et al. (2024) accelerated computation with LoGra and released LogIX, while other recent advances include random projection theory (Hu et al., 2026) and proxy models that preserve influence structure (IProX).

What Can We Do With Influence Functions?

Below we introduce several research examples at our Jinesis AI Lab and EuroSafeAI, where we use IFs to explain undesirable model behavior.

Tracing Misaligned Responses in Safety Evaluation

Influence functions (IFs) have the potential to provide a mechanistic lens to diagnose systematic vulnerabilities in LLMs, such as those identified in SocialHarmBench (Pandey et al., 2025). When models generate harmful or revisionist outputs, particularly in under-documented regions where attack success rates reach 98%, IFs facilitate researchers to approximately trace these activations back to specific documents in the training corpus. This process determines whether a failure stems from the absorption of state propaganda, euphemistic media coverage, or simple confabulation due to data scarcity. By pinpointing these causal drivers, developers can implement durable data-level interventions and "unlearn" specific harmful influences, offering a more robust alternative to the superficial patches typically provided by iterative RLHF.

Auditing Benchmark Contamination in Capability Evaluation

Benchmark contamination poses a severe threat to the validity of LLM evaluations, as models often achieve high scores through rote memorization rather than genuine reasoning. While standard detection relies on perplexity or string matching, IFs offer a more principled heuristic auditing approach by approximating if a model’s performance is driven by exposure to evaluation sets during pre-training. This is further contextualized by Test of Time (Zhang et al., 2025), which demonstrates how temporal degradation and data leakage can inflate perceived capabilities. By approximating the influence of a suspiciously accurate response back to the training distribution, researchers can mechanistically distinguish between generalization and mere memorization, ensuring that safety and capability evaluations remain trustworthy as training corpora grow increasingly opaque.

Open Challenges and Future Directions

Improving Training Dataset Quality

By design, IF needs to run on open-data models such as OLMo and Nemotron 3, sadly most open-weight models such as DeepSeek, GLM, MiniMax, Llama and many more do not publicly release their training data. Even for the existing open-data models, many of their training corpora contains inaccurate tags of subgroups, formatting errors and sometimes outright incorrect information. We outline a couple of principles that would significantly benefit IF in future AI data practices: (1) Better categorization and tagging of training data: many web-scale corpora are poorly classified (often just by their sources) without more fine-grained tagging on their contents, correctness and format; (2) Better verification and data cleaning: this problem is really the elephant in the room, there has been reports of errors found in current open-data corpora and if you take some time to read some on HuggingFace, it’s not hard to find that these errors are fairly frequent; (3) Synthetic data is not immune to these errors: while NVIDIA’s open corpus are very often synthetic, but we often forget that synthetic CoT can also be simply wrong (even if the answer is right sometimes) There are still a lot of foundational work to be done in AI data (for example, carefully going through current open-data corpus and find the mistakes in them) it’s intensive, heavy-lifting but someone needs to do it eventually. (Again, this is an elephant in the room problem, and AI data workflow is far from the perfect state many imagine it to be!)

Bridging the Approximation Gap

Running IF analysis over a full, frontier-scale training corpus remains highly compute-intensive. More critically, current methodologies face an unresolved debate regarding the trade-offs between first-order and second-order approximations. As Wei et al. (2024) note, there are legitimate concerns regarding whether current approximations maintain sufficient fidelity while keeping compute overhead feasible for academic and alignment research settings. Given that a primary utility of IFs is to save compute relative to retraining a model from scratch, future alignment research must focus on narrowing the fidelity gap between first-order and second-order approximations versus full retraining. Future work needs to push the pareto frontier of efficient use of compute vs. accurate approximation of influence.

Acknowledgment

We sincerely appreciate helpful research discussions with Roger Grosse, Lev McKinney, Juhan Bae, and Keenan Samway for their technical advice on our work.

18

Explaining undesirable model behavior: (How) can influence functions help?

18

What Are Influence Functions?

What Can We Do With Influence Functions?

Open Challenges and Future Directions

18

18