x

LESSWRONG

LW

Winnie Yang — LessWrong

Winnie Yang

Winnie Yang

Message

28

2

3

2y

Winnie Yang

28

2y

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM

Content Warning: This blog post contains examples of harmful language generated by LLM. Note: This blog post summarizes very preliminary results after a "weekend hackathon", inspired by the work by Ball et al. (2024). Motivation and Background Approaches to "jailbreaking" AI models—making them perform harmful actions despite harmlessness training—pose a...

Aug 28, 2024•7

Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs

Executive Overview As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an...

Aug 22, 2024•23