Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Content Warning: This blog post contains examples of harmful language generated by LLM. Note: This blog post summarizes very preliminary results after a "weekend hackathon", inspired by the work by Ball et al. (2024). Motivation and Background Approaches to "jailbreaking" AI models—making them perform harmful actions despite harmlessness training—pose a...
Aug 28, 20247