LESSWRONG
LW

749
Wikitags

Jailbreaking (AIs)

This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged Jailbreaking (AIs)
50Role embeddings: making authorship more salient to LLMs
Ω
Nina Panickssery, Christopher Ackerman
9mo
Ω
0
37Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
Ω
ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave, Kellin Pelrine
8mo
Ω
0
12A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More
Sharat Jacob Jacob
1y
0
9Interpreting the effects of Jailbreak Prompts in LLMs
Harsh Raj
1y
0
1Jailbreaking Claude 4 and Other Frontier Language Models
James Sullivan
4mo
0
18Detecting out of distribution text with surprisal and entropy
Sandy Fraser
9mo
4
17Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
Archimedes
8mo
1
16Intriguing Properties of gpt-oss Jailbreaks
zroe1, Jack Sanderson
2mo
0
10Using hex to get murder advice from GPT-4o
Q
Laurence Freeman
1y
Q
5
9Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails
Devina Jain
8mo
0
7Breaking GPT-OSS: A brief investigation
michaelwaves
1mo
0
4Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim
1y
0
3Philosophical Jailbreaks: Demo of LLM Nihilism
artkpv
5mo
0
Add Posts