Zeming Wei

Message

weizeming.github.io

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Paper link: https://arxiv.org/pdf/2310.06387.pdf Abstract: > Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. In this paper, we explore the power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs. We find...

Oct 30, 2023•3

Message

weizeming.github.io

2 karma

1 post

Member for 3 years

Zeming Wei — LessWrong

Zeming Wei

Message

weizeming.github.io

Zeming Wei

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Oct 30, 2023•3

Message

weizeming.github.io

2 karma

1 post

Member for 3 years

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei

Paper link: https://arxiv.org/pdf/2310.06387.pdf

Abstract:

Large Language Models (LLMs) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. In this paper, we explore the power of In-Context Learning (ICL) in manipulating the alignment ability of LLMs. We find that by providing just few in-context demonstrations without fine-tuning, LLMs can be manipulated to increase or decrease the probability of jailbreaking, \textit{i.e.} answering malicious prompts. Based on these observations, we propose In-Context Attack (ICA) and In-Context Defense (ICD) methods for jailbreaking and guarding aligned language model purposes. ICA crafts malicious contexts to guide models in generating harmful outputs, while ICD enhances model robustness by demonstrations

... (read more)

LESSWRONG
LW

LESSWRONG
LW

Zeming Wei

Zeming Wei

Zeming Wei

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei

Zeming Wei

Zeming Wei

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations