This post summarizes and comments on Motivating the Rules of the Game for Adversarial Example Research
Summary of paper
Despite the amount of recent work done, human-inperceptible perturbation adversarial attacks (Example: One Pixel Attack) are not as useful as the researchers may think, for two reasons:
- They are not based on realistic attacks against these AI systems.
we were unable to find a compelling example that required indistinguishability.
... the best papers on defending against adversarial examples carefully articulate and motivate a realistic attack model, ideally inspired by actual attacks against a real system.
There are much better attack methods where a real adversary could use:
- Test-set attack. Just keep feeding it natural inputs until it gets an error. As long as the system is not error-free this will succeed.
- It's suggested that such attacks can be used to fool speeding cameras, by adding human-invisible dots to a license plate. But if one is actually caught speeding, the failure of the algorithm would simply prompt a human review. Much more usefully, one can use clear sprays that are intended to overexpose any photographed image of the licence plate.
- Similarly, it's suggested that a perturbation attack could threaten to make a self-driving car to misidentify a STOP sign. But any robust self-driving car must deal with situations much worse than that, such as the stop sign not existing, not visible, and people breaking the traffic rules.
- Against security cameras which automatically identify blacklisted people, one can wear a good mask. Since there probably are humans monitoring the camera feed, it's necessary to fool both the camera and the humans. Imagine a monitoring system that automatically displays people's real faces on camera and the best match faces. A human monitor would immediately notice the discrepancy.
- They are not very fruitful for improving robustness.
In practice, the best solutions to the l_p problem are essentially to optimize the metric directly and these solutions seem not to generalize to other threat models.
If so much work has been done for such dubious gains, I have two bitter questions:
- Why did they work on the perturbation attacks so much?
- Why are these works so fun to read?
The second question partially answers the first: because they are fun. But that can't be the only explanation. I think the other explanation is that perturbational adversarial examples are easy, because they can be defined in one short equation, and trained without domain knowledge (just like the neural networks themselves).
As for why these works are so fun to read, I think it's because they are extremely humorous, and confirms comforting beliefs about human superiority. The humor comes from the contrast between tiny perturbations in input and big perturbations in output, between incomprehensible attacks and comprehensible results, between the strange behavior of neural networks and the familiar behavior of humans.
Gilmer, Justin, Ryan P. Adams, Ian Goodfellow, David Andersen, and George E. Dahl. “Motivating the Rules of the Game for Adversarial Example Research.” ArXiv Preprint ArXiv:1807.06732, 2018.