x

LESSWRONG

LW

Zac Hatfield-Dodds — LessWrong

Zac Hatfield-Dodds

Top postsTop post

Zac Hatfield-Dodds

Message

Technical staff at Anthropic (views my own), previously #3ainstitute; interdisciplinary, interested in everything, ongoing PhD in CS, bets tax bullshit, open sourcerer, more at zhd.dev

3736

Ω

465

11

439

8y

Zac Hatfield-Dodds

Technical staff at Anthropic (views my own), previously #3ainstitute; interdisciplinary, interested in everything, ongoing PhD in CS, bets tax bullshit, open sourcerer, more at zhd.dev

Top postsTop post

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed. Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe. Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input. Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts. In our latest paper, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, we outline evidence that there are better units of analysis than individual neurons, and we have built machinery that lets us find these units in small transformer models. These units, called features, correspond to patterns (linear combinations) of neuron activations. This provides a path to breaking down complex neural networks into parts we can understand, and builds on previous efforts to interpret high-dimensional systems in neuroscience, machine learning, and statistics. In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew

Claude's new constitution

Anthropic's Core Views on AI Safety

Simple probes can catch sleeper agents

131Apr 23, 2024

Claude's new constitution

Read the constitution. Previously: 'soul document' discussion here; the new constitution contains almost all of the 'soul document' content, but is >2x longer with a lot of new additions. (Zac and Drake work at Anthropic but are just sharing the linkpost and weren't heavily involved in writing this document.) We're...

Anthropic's updated Responsible Scaling Policy

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train...

Oct 15, 2024•38

Anthropic: Reflections on our Responsible Scaling Policy

Last September we published our first Responsible Scaling Policy (RSP) [LW discussion], which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as...

May 20, 2024•25

Simple probes can catch sleeper agents

by Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez, and evhub

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Twitter thread here. Top-level summary: > In this post we present "defection probes": linear classifiers that use...

Apr 23, 2024•131

Third-party testing as a key ingredient of AI policy

(nb: this post is written for anyone interested, not specifically aimed at this forum) We believe that the AI sector needs effective third-party testing for frontier AI systems. Developing a testing regime and associated policy interventions based on the insights of industry, government, and academia is the best way to...

Mar 25, 2024•11

Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy

I hope Dario's remarks to the Summit can shed some light on how we think about RSPs in general and Anthropic's RSP in particular, both of which have been discussed extensively since I shared our RSP announcement. The full text of Dario's remarks follows: Before I get into Anthropic’s Responsible...

Nov 1, 2023•80

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed. Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs...

Oct 5, 2023•289

Load More (7/12)