Jack Sanderson — LessWrong

3mo

This work was supported by UChicago XLab.

Today, we are announcing our first major release of the XLab AI Security Guide: a set of online resources and coding exercises covering canonical papers on jailbreaks, fine-tuning attacks, and proposed methods to defend AI systems from misuse.

Each page on the course contains a readable blog-style overview of a paper and often a notebook that guides users through a small replication of the core insight the paper makes. Researchers and students can use this guide as a structured course to learn AI security step-by-step or as a reference, focusing on specific sections relevant to their research. When completed chronologically, sections build on each other and become more advanced as students pick up conceptual insights and technical skills.

Why Create AI Security Resources?

While many safety-relevant...

(Continue Reading - 1245 more words)

Gradient routing is better than pretraining filtering

Jack Sanderson7mo60

I think that there's probably a better framing to this argument than "gradient routing is better than pretraining filtering." Assuming it's a binary decision and both are equally feasible for the purpose of safety:

For frontier/near-frontier labs, the question that determines which approach they'll use is probably how much they care about performance. Here, gradient routing is worse; its forget loss was shown to be inversely correlated with its retain loss, unlike data filtering. Granted, it wasn't a huge difference, but the tests were also on a very small

Jack Sanderson7mo71

People have correctly pointed out that this isn't the case for causing emergent misalignment, but it notably is for causing safety mechanisms to fail in general and is also the likely reason why attacks such as adversarial tokenization work.

Intriguing Properties of gpt-oss Jailbreaks

zroe1, Jack Sanderson

8mo

This is a linkpost for https://xlabaisecurity.com/blog/gpt-oss-jailbreaks/

From the UChicago XLab AI Security Team: Zephaniah Roe, Jack Sanderson, Julian Huang, Piyush Garodia

Correspondence to team@xlabaisecurity.com. For the best reading experience, we recommend viewing on our website.

Methodology note: This red-teaming sprint may include small metric inaccuracies (see Appendix).

Summary

We red-teamed gpt-oss-20b and found impressive robustness but exploitable chain-of-thought (CoT) failure modes. In this post, we characterize the model’s vulnerabilities and identify intriguing patterns in its failures and successes.

We probe gpt-oss using a jailbreak that rewrites the request, embeds it in a compliant-sounding CoT template, and repeats that template up to 1,500 times. Surprisingly, we find that more CoT repetitions do not always correspond to higher attack success rates. Rather, there appears to be a specific area of the model’s context window that is most vulnerable to our...

(Continue Reading - 2687 more words)