x

LESSWRONG
LW

Jailbreaking (AIs) — LessWrong

Jailbreaking (AIs)

This page is a stub.

Posts tagged Jailbreaking (AIs)

2

51GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

TurnTrout, Rohin Shah

1mo

2

2

50Role embeddings: making authorship more salient to LLMs

Nina Panickssery, Christopher Ackerman

11mo

0

2

37Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine

10mo

0

2

12A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

Sharat Jacob Jacob

1y

0

2

9Interpreting the effects of Jailbreak Prompts in LLMs

1y

0

2

1Jailbreaking Claude 4 and Other Frontier Language Models

6mo

0

1

24Detecting out of distribution text with surprisal and entropy

10mo

4

1

17Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

10mo

1

1

16Intriguing Properties of gpt-oss Jailbreaks

zroe1, Jack Sanderson

4mo

0

1

10Using hex to get murder advice from GPT-4o

Laurence Freeman

1y

5

1

9Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

10mo

0

1

7Breaking GPT-OSS: A brief investigation

3mo

0

1

5Exploring the multi-dimensional refusal subspace in reasoning models

Le magicien quantique

1mo

2

1

4Jailbreaking ChatGPT and Claude using Web API Context Injection

1y

0

1

3Philosophical Jailbreaks: Demo of LLM Nihilism

6mo

0

Load More (15/15)

Add Posts