Nate Thomas

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

> This is a link post for two AI safety programs we’ve just opened applications for: https://www.constellation.org/programs/astra-fellowship and https://www.constellation.org/programs/researcher-program Constellation is a research center dedicated to safely navigating the development of transformative AI. We’ve previously helped run the ML for Alignment Bootcamp (MLAB) series and Redwood’s month-long research program on...

Oct 26, 202342

Causal scrubbing: results on induction heads

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to induction heads. The results are also summarized here. Introduction In this post, we’ll apply the causal scrubbing methodology to investigate how induction heads work in a particular 2-layer attention-only language model.[1] While we...

Dec 3, 202234

Causal scrubbing: results on a paren balance checker

* Authors sorted alphabetically. This is a more detailed look at our work applying causal scrubbing to an algorithmic model. The results are also summarized here. Introduction In earlier work (unpublished), we dissected a tiny transformer that classifies whether a string of parentheses is balanced or unbalanced.[1] We hypothesized the...

Dec 3, 202239

Causal scrubbing: Appendix

* Authors sorted alphabetically. An appendix to this post. 1 More on Hypotheses 1.1 Example behaviors As mentioned above, our method allows us to explain quantitatively measured model behavior operationalized as the expectation of a function f on a distribution D. Note that no part of our method distinguishes between...

Dec 3, 202218

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

* Authors sorted alphabetically. Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model...

Dec 3, 2022207

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. REMIX participants will work to provide mechanistic explanations...

Oct 27, 2022135

High-stakes alignment via adversarial training [Redwood Research report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) This post motivates and summarizes this paper from Redwood Research, which presents results from the project first introduced here. We used...

May 5, 2022142

Nate Thomas

Nate Thomas

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

High-stakes alignment via adversarial training [Redwood Research report]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

We're Redwood Research, we do applied alignment research, AMA

Nate Thomas

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

High-stakes alignment via adversarial training [Redwood Research report]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

We're Redwood Research, we do applied alignment research, AMA

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

Causal scrubbing: results on induction heads

Causal scrubbing: results on a paren balance checker

Causal scrubbing: Appendix

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

High-stakes alignment via adversarial training [Redwood Research report]