Andy Arditi

In-context learning of representations can be explained by induction circuits

This is a crosspost of my ICLR 2026 blogpost track post. All code and experiments are available at github.com/andyrdt/iclr_induction. Summary Park et al., 2025 show that when large language models (LLMs) process random walks on a graph, their internal representations come to mirror the underlying graph's structure. The authors interpret...

Mar 249

Andy Arditi's Shortform

Feb 225

Finding "misaligned persona" features in open-weight models

This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about this research direction, but stopped pursuing it after learning about similar work from OpenAI (Wang et al., 2025). We're sharing some of the initial results...

Sep 9, 202549

Follow-up experiments on preventative steering

by RunjinChen and Andy Arditi

This post serves as a follow-up to our recent work on persona vectors. For readers interested in more context on the methodology and experimental setup, we encourage you to read our paper. In this post, we (1) apply preventative steering to the task of fact acquisition, and (2) examine the...

Sep 6, 202534

Persona vectors: monitoring and controlling character traits in language models

by RunjinChen and Andy Arditi

This is a brief summary of our recent paper. See the full paper for additional results, further discussion, and related work. Abstract Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these...

Aug 1, 202526

Do models say what they learn?

This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to...

Mar 22, 2025127

Finding Features Causally Upstream of Refusal

by Daniel Lee, Eric Breck, and Andy Arditi

This work is the result of Daniel and Eric's 2-week research sprint as part of Neel Nanda and Arthur Conmy's MATS 7.0 training phase. Andy was the TA during the research sprint. After the sprint, Daniel and Andy extended the experiments and wrote up the results. A notebook that contains...

Jan 14, 202556

Andy Arditi

Andy Arditi

Refusal in LLMs is mediated by a single direction

Do models say what they learn?

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Unlearning via RMU is mostly shallow

Andy Arditi

Refusal in LLMs is mediated by a single direction

Do models say what they learn?

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Unlearning via RMU is mostly shallow

In-context learning of representations can be explained by induction circuits

Andy Arditi's Shortform

Finding "misaligned persona" features in open-weight models

Follow-up experiments on preventative steering

Persona vectors: monitoring and controlling character traits in language models

Do models say what they learn?

Finding Features Causally Upstream of Refusal