TL;DR: Influence functions are commonly used to attribute model behavior to its training data. In this paper we explored the reverse: whether it was possible to use influence functions to craft training data that induces model behavior?
We introduce Infusion, a framework that harnesses LLM-scale influence function approximations (Grosse et al., 2023) to compute small perturbations to the original training documents. The goal is to induce targeted changes in model behavior through parameter shifts - without inserting any new documents, just quietly editing ones that are already there.
From Koh & Liang (2017) and Grosse et al. (2023), the influence of training document on a measurement of the model is :
where:
is the gradient of a measurment of a behavior of interest
is the Hessian, describing the local curvature of the model's loss landscape
is the gradient of the loss on that specific training document
In Infusion, we formalize how replacing a document with a perturbed document induces a parameter shift
and how this shift changes the measurement via:
We can then solve for the document perturbation using Projected Gradient Descent to maximize our measurement!
What we found
We validated Infusion data poisoning tasks in both vision and language domains.
Vision Models
Insight 1: On CIFAR-10, small edits to just 0.2% of training documents (100/45,000) was competitive with the simpler baseline of directly inserting a small number of explicit examples of the behavior.
Insight 2: Infused corpora transferred across model architectures - a corpus crafted to affect ResNet also affected a simple CNN on some examples, and vice versa. This suggests that a single edited dataset might be able to compromise multiple independently trained models.
Language Models
We consider two different language experiments, a small transformer trained to solve Caesar ciphers and a small language model pretrained on Tiny Stories (Eldan & Li, 2023). Here we report pretty weak results - whilst we do find evidence that Infusion can increase the probably of target outputs, we didn't find it enough to change model behavior.
Insight 3: We find that Infusion struggles against high confidence models and predictions - document perturbations have limited headroom to shift model behavior and larger perturbations destroys model coherency.
Here we show the results of discrete PGD perturbation on a training document (probe=bee, target=cat).
Insight 4: Sometimes we are able to increase the probably of a target animal, sometimes we aren't! And the shifts are tiny, rarely enough to flip predictions.
Insight 5: We found during the Caesar cipher experiments that Infusion works best at amplifying behaviors and patterns in the model that already exist. In Caesar ciphers, the model learns to exploit spatial frequency, and we can see that Infusion's performance maps directly ontop of this pattern.
Limitations
This last insight is most important. Infusion - as it stands - is better understood as a way to express existing tendencies of the model rather than as a way to install new ones, and results on language are weak. For pretrained models, we would expect most relevant behaviors to already be present in the model, so Infusion may have more success there.
Scalability is also a constraint, while EKFAC makes the Hessian approximations tractable, the method is still relatively expensive and slow.
Discussion
The ability to shape model behavior through subtle, hard-to-detect edits to training data has obvious security implications. This framework is also by nature dual-use: the same framework an adversary might use to poison model could, in principle, be used by a defender to patch undesired behaviors at the data level.
As models are trained on ever-larger corpora assembled from diverse and loosely verified sources, understanding that attack surface is increasingly important and we hope this work sparks further research into TDA at LLM-scale.
TL;DR: Influence functions are commonly used to attribute model behavior to its training data. In this paper we explored the reverse: whether it was possible to use influence functions to craft training data that induces model behavior?
We introduce Infusion, a framework that harnesses LLM-scale influence function approximations (Grosse et al., 2023) to compute small perturbations to the original training documents. The goal is to induce targeted changes in model behavior through parameter shifts - without inserting any new documents, just quietly editing ones that are already there.
Paper: https://arxiv.org/abs/2602.09987
Code: https://github.com/jrosseruk/infusion
Method
From Koh & Liang (2017) and Grosse et al. (2023), the influence of training document on a measurement of the model is :
where:
In Infusion, we formalize how replacing a document with a perturbed document induces a parameter shift
and how this shift changes the measurement via:
We can then solve for the document perturbation using Projected Gradient Descent to maximize our measurement!
What we found
We validated Infusion data poisoning tasks in both vision and language domains.
Vision Models
Insight 1: On CIFAR-10, small edits to just 0.2% of training documents (100/45,000) was competitive with the simpler baseline of directly inserting a small number of explicit examples of the behavior.
Insight 2: Infused corpora transferred across model architectures - a corpus crafted to affect ResNet also affected a simple CNN on some examples, and vice versa. This suggests that a single edited dataset might be able to compromise multiple independently trained models.
Language Models
We consider two different language experiments, a small transformer trained to solve Caesar ciphers and a small language model pretrained on Tiny Stories (Eldan & Li, 2023). Here we report pretty weak results - whilst we do find evidence that Infusion can increase the probably of target outputs, we didn't find it enough to change model behavior.
Insight 3: We find that Infusion struggles against high confidence models and predictions - document perturbations have limited headroom to shift model behavior and larger perturbations destroys model coherency.
Here we show the results of discrete PGD perturbation on a training document (probe=bee, target=cat).
Insight 4: Sometimes we are able to increase the probably of a target animal, sometimes we aren't! And the shifts are tiny, rarely enough to flip predictions.
Insight 5: We found during the Caesar cipher experiments that Infusion works best at amplifying behaviors and patterns in the model that already exist. In Caesar ciphers, the model learns to exploit spatial frequency, and we can see that Infusion's performance maps directly ontop of this pattern.
Limitations
This last insight is most important. Infusion - as it stands - is better understood as a way to express existing tendencies of the model rather than as a way to install new ones, and results on language are weak. For pretrained models, we would expect most relevant behaviors to already be present in the model, so Infusion may have more success there.
Scalability is also a constraint, while EKFAC makes the Hessian approximations tractable, the method is still relatively expensive and slow.
Discussion
The ability to shape model behavior through subtle, hard-to-detect edits to training data has obvious security implications. This framework is also by nature dual-use: the same framework an adversary might use to poison model could, in principle, be used by a defender to patch undesired behaviors at the data level.
As models are trained on ever-larger corpora assembled from diverse and loosely verified sources, understanding that attack surface is increasingly important and we hope this work sparks further research into TDA at LLM-scale.