Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://arxiv.org/abs/2309.00236

You can try our interactive demo! (Or read our preprint.) 

Here, we want to explain why we care about this work from an AI safety perspective.

Concerning Properties of Image Hijacks

What are image hijacks?

To the best of our knowledge, image hijacks constitute the first demonstration of adversarial inputs for foundation models that

  • force the model to perform some arbitrary behaviour  (e.g. "output the string Visit this website at malware.com!"), 
  • while being barely distinguishable from a benign input,
  • and automatically synthesisable given a dataset of examples of 

It's possible that a future text-only attack could do these things, but such an attack hasn't yet been demonstrated.

Why should we care? 

We expect that future (foundation-model-based) AI systems will be able to

  • consume unfiltered data from the Internet (e.g. searching the Web),
  • access sensitive personal information (e.g. a user's email history), and
  • take actions in the world on behalf of a user (e.g. sending emails, downloading files, making purchases, executing code).

As the actions of such foundation-model-based agents are based on the foundation model's text output, hijacking the foundation model's text output could give an adversary arbitrary control over the agent's behaviour.

Relevant AI Safety Projects

Race to the top on adversarial robustness. Robustness to attacks such as image hijacks is (i) a control problem, (ii) which we can measure, and (iii) which has real-world safety implications today. So we're excited to see AI labs compete to have the most adversarially robust models.

Third-party auditing and certification. Auditors could test for robustness against image hijacks, both at the foundation model level (auditing the major AGI corporations) and at the app development level (auditing downstream companies integrating foundation models into their products). Image hijacks could also be used to test for the presence of dangerous capabilities (characterisable as some behaviour ) by attempting to train an image hijack for that capability.

Liability for AI-caused harms, penalizing externalities. Both the Future of Life Institute and Jaan Tallinn advocate for liability for AI-caused harms. When assessing AI-caused harms, image hijacks may need to be part of the picture.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 3:29 PM
[-]LawrenceC7moΩ470

Glad to see that this work is out! 

I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model. 

It's very important to know how well multimodal models perform when only shown extreme lossy compressed images, and whether these attacks work through extreme lossy compression. Feels like compressing everything horrifically before showing to models might solve it.

One needs to be very careful when proposing a heuristically-motivated adversarial defense. Otherwise, one runs the risk of being the next example in a paper like this one, which took a wide variety of heuristic defense papers and showed that stronger attacks completely break most of the defenses.

In fact, Section 5.2.2 completely breaks the defense from a previous paper that considered image cropping and rescaling, bit-depth reduction, and JPEG compression as baseline defenses, in addition to new defense methods called total variance minimization and image quilting. While the original defense paper claimed 60% robust top-1 accuracy, the attack paper totally circumvents the defense, dropping accuracy to 0%.

If compression was all you needed to solve adversarial robustness, then the adversarial robustness literature would already know this fact.

My current understanding is that an adversarial defense paper should measure certified robustness to ensure that no future attack will break the defense. This paper gives an example of certified robustness. However, I'm not a total expert in adversarial defense methodology, so I'd encourage anyone considering writing an adversarial defense paper to talk to such an expert.

I would bet money, maybe $2k, that I can create a robust system using a combination of all the image compression techniques I can conveniently find and a variety of ml models with self consistency that achieves >50% robust accuracy even after another year of attacks Edit: on inputs that don't look obviously corrupted or mangled to an average human

In how many months?

This feels like an obvious case of high alignment tax that won't be competitive.

I think the tax will be surprisingly low. Most of the time people or chatbots see images, they don't actually need to see much detail. As an analogy, i know multiple people who are almost legally blind (see 5-10x less resolution than normal), and they can have long conversations about the physical world ("where's my cup?\n[points] over there" ect) without anyone noticing that their vision is subpar.

For example, if you ask a chatbot "Suggest some landscaping products to use on my property

I claim the chatbot will be able to respond almost as well as if you had given it 

even though the former image is 1kb vs the latter's 500kb. Defense feels pretty tractable if each input image is only ~1kb.

I think this is an interesting point. We are actually conducting some follow-up work seeing how robust our attacks are to various additional "defensive" perturbations (e.g. downscaling, adding noise).  As Matt notes, when doing these experiments it is important to see how such perturbations also affect the models general vision language modeling performance. My prior right now is that using this technique it may be possible to defend against the L infinity constrained images, but possibly not the moving patch attacks that showed higher level features. In general adversarial attacks are a cat and mouse game, so I expect that if we can show you can defend using techniques like this, a new training scheme will come along that is able to make adversaries that are robust to such defenses. It is worth noting also that most VLMs only accept small low resolution images already. For example LLaVA (with llama 13b), which is state of the art for open source, only accepts ~200 * 200 pixel sized image, so the above example is not necessarily a fair one.

I expect lossy image compression to perform better than downsampling or noising because it's directly destroying the information that humans don't notice while keeping information that humans notice. Especially if we develop stronger lossy encoding using vision models, it really feels like we should be able to optimize our encodings to destroy the vast majority of human-unnoticed information.