Navigating the Attackspace

Jonas Kgomo

As Artificial Intelligence (AI) continues its rapid ascent, we are witnessing increasing evidence of well-crafted methods to undermine and exploit AI responses. Carefully crafted inputs can exploit vulnerabilities and lead to harmful or undesired results.

In recent times, we have seen numerous individuals independently uncover failure modes within AI systems, particularly through targeted attacks on language models.

tipping a language model with 'money' to get longer context answers
using Zulu to get harmful responses

By crafting clever prompts and inputs, these individuals have exposed the models' vulnerabilities, causing them to generate offensive content, reveal sensitive information, or behave in ways that deviate from their intended purpose.

To address these concerns and gain a deeper understanding of this evolving threat landscape, we are undertaking two key initiatives:

1. Attack Space

AttackSpace is an open-source curated comprehensive list of LLM security methods and safeguarding techniques.

Located at https://github.com/equiano-institute/attackspace, this open-source repository is dedicated to collecting and documenting known attacks on language models. This comprehensive resource serves as a vital platform for researchers, developers, and the general public to access information on these vulnerabilities. By fostering a collaborative space for information sharing, we aim to accelerate research and development efforts focused on improving the robustness and security of language models. This collection contains work by Viktoria Krakovna. The goal is to have a structured view and characterisation of the latent attack space. We want to model satisfiability of AI attacks. This concept, similar to the P-NP problem, involves determining whether a given set of conditions can be met to successfully launch an attack against a language model. By analyzing the features and conditions of successful attacks, we can develop efficient algorithms for identifying and preventing future attacks.

2. Project Haystack

A suite of red teaming and evaluation frameworks for language models

This project aims to develop Haystack, an open-source platform for red teaming and human feedback on LLMs with crowd-sourcing and automated methods

Goals

build a frontend for Pythia model evaluation https://github.com/EleutherAI/pythia
develop a model evaluation interfaces for red teaming from scratch https://arxiv.org/abs/2306.09442
develop a representational view of interpretability https://arxiv.org/abs/2310.01405
organize open source human feedback data sharing

Standards

There is a need for standards for red teaming models, that can be fault tolerant to censorship https://arxiv.org/abs/2307.10719 and jailbreaks https://arxiv.org/abs/2310.02446

Challenges & Efforts

Despite there being many efforts to red team language models, there aren't any available open source frameworks for client-level user model evaluation and red teaming testing.

White House https://www.whitehouse.gov/ostp/news-updates/2023/08/29/red-teaming-large-language-models-to-identify-novel-ai-risks
UK (Royal Society) https://royalsociety.org/science-events-and-lectures/2023/10/ai-safety-science-redteam
The Trojan Detection Challenge 2023 (LLM Edition) https://trojandetection.ai/

2. Academic Survey:

Complementing the attack space collection is our ongoing academic survey, designed to gather crucial data and insights into the nature and scope of language model attacks. This survey delves into key questions such as:

What types of attacks have successfully exploited language models and how can we characterise these attacks?
What underlying explainable vulnerabilities enable these attacks ?
What potential consequences and risks do these attacks pose?
What mitigation strategies and research directions can address these vulnerabilities?

We encourage you to join us in this endeavor. Contribute your knowledge and expertise to the attack space collection or participate in our academic survey. As we continue to ensure the responsible development and deployment of AI technology, safeguarding both its immense potential and the well-being of society.

LESSWRONG
LW

LESSWRONG
LW

1

Navigating the Attackspace

1

1

1