This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance.
Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any setup change, experiment advice, or idea is welcomed. We also welcome any relevant paper that we could have missed. For a better introduction to interpreting adversarial attacks, we recommend reading scasper’s post: EIS IX: Interpretability and Adversaries.
Code available on GitHub (still drafty).
Adversarial samples are a concerning failure mode of deep learning models. It has been shown that by optimising an input, e.g. an image, the output of the targeted model can be shifted at the attacker’s will. This failure mode appears repeatedly in different domains like image classification, text encoders, and more recently on multimodal models enabling arbitrary outputs or jailbreaking.
This work was partly inspired by the feature vs bug world hypothesis. In our particular context this would imply that adversarial attacks might be caused by meaningful feature directions being activated (feature world) or by high-frequency directions overfitted by the model (bug world). The hypotheses we would like to test are thus the following:
First, we briefly present the background we used for adversarial attacks and linear probing. Then we showcase experiments, presenting our setup, to understand the impact of adversarial attacks on linear probes and see if we can detect it naively. Finally, we present the limitations and perspectives of our work before concluding.
For our experiment, we'll begin with the most basic and well-known adversarial attack, the Fast Gradient Sign Method (FGSM). This intuitive method takes advantage of the classifier differentiability to optimise the adversary goal (misclassification) using a gradient descent. It can be described as:
With the original image and label, the adversarial example and the perturbation amplitude. This simple method can be derived in an iterative form:
and seen as projected gradient descent, with the projection ensuring that . In our experiments .
Linear probing is a simple idea where you train a linear model (probe) to predict a concept from the internals of the interpreted target model. The prediction performances are then attributed to the knowledge contained in the target model's latent representation rather than to the simple linear probe. Here, we are mostly interested in binary concepts, e.g. features presence/absence or adversarial/non-adversarial, and the optimisation can be formulated as minimising the following loss:
where in our experiments is a logistic regression using the penalty.
In order to explore image concepts we first focused on a small category of images, fruits and vegetables, before choosing easy concepts like shapes and colours. We proceeded as follow:
The goal is to validate that probes are detecting the concepts and not something else that is related to the class (we are not yet totally confident about our probes, more in the limitations).
See what kind of features (if any) adversarial attacks find. Increase of the probe's accuracy on non-related features w.r.t. the input sample but related to the target sample.
For this experiment, we show that given an input image of a lemon, it requires a few steps of adversarial perturbation for the model to output the desired target, tomato in our case (see figures 4-6).
We compare the appearance of two concepts “ovaloid” and “sphere” in the last layer of the vision encoder, which are the shapes of the lemon and the tomato respectively. It appears in Figure 7 that the concept “ovaloid” characterising the lemon disappears gradually with the increase of the adversarial steps, while the “sphere” concept marking the tomato starts to appear (see Figure 8).
We noted that the concept “sphere” which characterises a tomato, does not appear and remains unchanged in the first layers of the encoder (see Figure 9 and 10 for layers 0 and 6 respectively) despite the adversarial steps.
We focus our interest in a concept that characterises both a lemon and a banana, the colour “yellow”. The adversarial attack does not change the representation of this concept through the layers and the adversarial steps (see Figure 11, Figure 12 and Figure 13).
This section contains early-stage claims not fully backed, thus we would love hearing feedback on this methodology.
We spotted interesting concept activation patterns through layers, and we believe they could help detect adversarial attacks once formalised. We spotted two patterns:
Looking at the concept curve (e.g. on the CLS in Figure 14) we see that there is a transition mostly on the last layer. This could indicate that our adversarial attacks mostly target later layers and thus could be detected spotting incoherent patterns like the probability in Figure 14 (L0: 02 -> L4: 0.35 -> L8: 0.8 -> L10: 0.0).
Intuitively this would only be fair for simple concepts that are in the early layers.
We present what we are excited to (see) tackle next in the perspectives section.
Interpretability is known to have illusion issues and linear probing doesn’t make an exception. We are not totally confident that our probes do measure their associated concept. In this respect we aim to further improve and refine our dataset.
We trained one probe per concept and per layer (on all visual tokens). Our intuition is that information is mixed by the cross-attention mechanism and similarly represented across tokens for such simple concepts.
We are not tackling some adversarial attacks that have been labelled as truly bugs. We are interested in testing our method on it as this could remain a potential threat.
We now describe what we are excited to test next and present possible framing of this study for future work.
First, we want to make sure that we are training the correct probes. Some proxies are mixed with concepts, e.g. pulp is mostly equivalent to orange + lemon. For that, we are thinking of ways to augment our labelled dataset using basic CV tricks.
We are moving to automated labelling using GPT4V, while keeping a human in the loop overseeing the labels produced. And we are also thinking about creating concept segmentation maps as the concepts localisation might be relevant (c.f. the heatmaps above).
In the longer term view, we are also willing to explore the following tweaks:
Our method doesn’t guarantee that we’ll be able to detect any adversarial attacks but showcases that some mechanistic measures can be sanity safeguards. These linear probes are not costly to train nor to use during inference and can help detect adversarial attacks and possibly other kinds of anomalies.
We collected a dataset of fruits and vegetables image classification from Kaggle (insert link). The dataset was cleaned and manually labelled by experts. The preprocessing included a deduplication step to remove duplicate images, and a filtering of images that were present in both the train and test sets, which encouraged data leakage.
For the annotation process, there were 2 annotators who labelled each image with the concepts that appear on it, each image is associated with at least 1 concept. An additional step was required to resolve annotation conflicts and correct the concepts.
After preprocessing, the train set contains over 2579 samples and the test set over 336 samples with 36 classes each; the class distributions over sets are shown in Figure 17.
The chosen concepts concern the colour, shape and background of the images. Figure 18 shows the distribution of concepts over the whole dataset.