This is a crosspost of my ICLR 2026 blogpost track post. All code and experiments are available at github.com/andyrdt/iclr_induction. Summary Park et al., 2025 show that when large language models (LLMs) process random walks on a graph, their internal representations come to mirror the underlying graph's structure. The authors interpret...
This work was conducted in May 2025 as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey. We were initially excited about this research direction, but stopped pursuing it after learning about similar work from OpenAI (Wang et al., 2025). We're sharing some of the initial results...
This post serves as a follow-up to our recent work on persona vectors. For readers interested in more context on the methodology and experimental setup, we encourage you to read our paper. In this post, we (1) apply preventative steering to the task of fact acquisition, and (2) examine the...
This is a brief summary of our recent paper. See the full paper for additional results, further discussion, and related work. Abstract Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these...
This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to...
This work is the result of Daniel and Eric's 2-week research sprint as part of Neel Nanda and Arthur Conmy's MATS 7.0 training phase. Andy was the TA during the research sprint. After the sprint, Daniel and Andy extended the experiments and wrote up the results. A notebook that contains...