From the UChicago XLab AI Security Team: Zephaniah Roe, Jack Sanderson, Julian Huang, Piyush Garodia
Correspondence to team@xlabaisecurity.com. For the best reading experience, we recommend viewing on our website.
Methodology note: This red-teaming sprint may include small metric inaccuracies (see Appendix).
Summary
We red-teamed gpt-oss-20b and found impressive robustness but exploitable chain-of-thought (CoT) failure modes. In this post, we characterize the model’s vulnerabilities and identify intriguing patterns in its failures and successes.
We probe gpt-oss using a jailbreak that rewrites the request, embeds it in a compliant-sounding CoT template, and repeats that template up to 1,500 times. Surprisingly, we find that more CoT repetitions do not always correspond to higher attack success rates. Rather, there appears to... (read 2707 more words →)