Announcing the CNN Interpretability Competition

Interpretability research is popular, and interpretability tools play a role in almost every agenda for making AI safe. However, there are some gaps between the research and engineering applications. If one of our main goals for interpretability research is to help us align highly intelligent AI systems in high-stakes settings, we need more tools that help us better solve practical problems.

One of the unique advantages of interpretability tools is that, unlike test sets, they can sometimes allow humans to characterize how networks may behave on novel examples. For example, Carter et al. (2019), Mu and Andreas (2020), Hernandez et al. (2021), Casper et al. (2022a), and Casper et al. (2023) have all used different interpretability tools to identify novel combinations of features that serve as adversarial attacks against deep neural networks.

Interpretability tools are promising for exercising better oversight, but human understanding is hard to measure, and it has been difficult to make clear progress toward more practically useful tools. Here, we work to address this by introducing the CNN Interpretability Competition (accepted to SATML 2024).

The key to the competition is to develop interpretations of the model that help human crowdworkers discover trojans: specific vulnerabilities implanted into a network in which a certain trigger feature causes the network to produce unexpected output. In addition, we also offer an open-ended challenge for participants to discover the triggers for secret trojans by any means necessary.

The motivation for this trojan-discovery competition is that trojans are bugs caused by novel trigger features -- they usually can’t be identified by analyzing model performance on some readily available dataset. This makes finding them a challenging debugging task that mirrors the practical challenge of finding unknown bugs in models. However, unlike naturally occurring bugs in neural networks, the trojan triggers are known to us, so it will be possible to know when an interpretation is causally correct or not. In the real world, not all types of bugs in neural networks are likely to be trojan-like. However, benchmarking interpretability tools using trojans can offer a basic sanity check.

The Benchmark

This competition follows new work from Casper et al. (2023) (will be at NeurIPS 2023), in which we introduced a benchmark for interpretability tools based on helping human crowdworkers discover trojans that had interpretable triggers. We used 12 trojans of three different types: ones that were triggered by patches, styles, and naturally occurring features.

An example each of a style, patch, and natural feature trojan. Details on all trojans are in the table below.

We then evaluated 9 methods meant to help detect trojan triggers: TABOR, (Guo et al., 2019), four variants of feature visualizations, (Olah et al., 2017; Mordvintsev et al., 2018), adversarial patches (Brown et al., 2017), two variants of robust feature-level adversaries (Casper et al., 2022a), and SNAFUE (Casper et al., 2022b). We tested each based on how much they helped crowdworkers identify trojan triggers in multiple-choice questions. Overall, this work found some successes. Adversarial patches, robust feature-level adversaries, and SNAFUE were relatively successful at helping humans discover trojan triggers.

Results for all 12 trojans across all 9 methods plus a tenth method that used each of the 9 together. Each cell shows the proportion of the time crowdworkers guessed the trojan trigger correctly in a multiple-choice question. There is a lot of room for improvement.

However, even the best-performing method -- a combination of all 9 tested techniques -- failed to help humans identify trojans successfully from multiple-choice questions half of the time. The primary goal of this competition is to improve on these methods.

In contrast to prior competitions such as the Trojan Detection Challenges, this competition uniquely focuses on interpretable trojans in ImageNet CNNs including natural-feature trojans.

Main competition: Help humans discover trojans >= 50% of the time with a novel method

Prize: $4,000 for the winner and shared authorship in the final report for all submissions that beat the baseline.

The best method tested in Casper et al. (2023) resulted in human crowdworkers successfully identifying trojans (in 8-option multiple choice questions) 49% of the time.

How to submit:

Submit a set of 10 machine-generated visualizations (or other media, e.g. text) for each of the 12 trojans, a brief description of the method used, and code to reproduce the images. In total, this will involve 120 images (or other media), but please submit them as 12 images, each containing a row of 10 sub-images.
Once we check the code and images, we will use your data to survey 100 knowledge workers using the same method as we did in the paper.

We will desk-reject submissions that are incomplete (e.g. not containing code), not reproducible using the code sent to us, or produced entirely with code off-the-shelf from someone other than the submitters. The best-performing solution at the end of the competition will win.

Bonus challenge: Discover the four secret natural feature trojans by any means necessary

Prize: $1,000 split among all submitters who identify each trojan and shared authorship in the final report.

The trojaned network has 12 disclosed trojans but 4 additional secret ones (the bottom four rows of the table below).

How to submit:

Share with us a guess for one of the trojans, along with code to reproduce whatever method you used to make the guess and a brief explanation of how this guess was made. One guess is allowed per trojan per submitter.

The $1,000 prize for each of the 4 trojans will be split between all successful submissions for that trojan.

All 16 trojans for this competition. The first 12 are for the main competition, while the final 4 are for the bonus challenge.

What techniques might succeed?

Different tools for synthesizing features differ in what priors they place over the generated feature. For example, TABOR (Guo et al., 2019) imposes a weak one, while robust feature-level adversaries (Casper et al., 2022a) impose a strong one. Since the trojans for this competition are human-interpretable, we expect methods that visualize trojan triggers with highly-regularized features to be useful. Additionally, we found in Casper et al. (2023) that combinations of methods succeeded more than any individual method on its own, so techniques that produce diverse synthesized features may have an advantage. We also found that style trojans were the most difficult to discover, so methods that are well-suited to finding these will be novel and useful. Finally, remember that you can think outside the box! For example, captioned images are fair game.

Interpretability (ML & AI)AI

Frontpage

22 Ω 9

New Comment

Moderation Log