LESSWRONG
LW

RowanWang — LessWrong

Oh man, it totally was wrong, sorry about that, updated data again. I looked at the train datasets from the various models we trained and reran the data generation pipeline and the results looked as expected, so I don't think I trained models on the wrong data for the original results, but I'm not fully sure how this data mix came about. It looks like it's a combination of the followup and goals data, i think claude might have accidentally mixed them when i was having it sanitize it for release

also fwiw depending on what you're using this data for, you should probably just regenerate it, it's not that hard and you could probably easily generate more diverse data. it probably also helps if the prompts actually elicit deception on the model you're working with

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang

RowanWang, Sam Marks, Johannes Treutlein, evhub, Fabien Roger

3mo

TL;DR: We use a suite of testbed settings where models lie—i.e. generate statements they believe to be false—to evaluate honesty and lie detection techniques. The best techniques we studied involved fine-tuning on generic anti-deception data and using prompts that encourage honesty.

Read the full Anthropic Alignment Science blog post and the X thread.

Introduction:

Suppose we had a “truth serum for AIs”: a technique that reliably transforms a language model M into an honest model HM that generates text which is truthful to the best of its own knowledge. How useful would this discovery be for AI safety?
We believe it would be a major boon. Most obviously, we could deploy HM in place of M. Or,

... (read 951 more words →)

Building and evaluating alignment auditing agents

Sam Marks

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein, evhub

7mo

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4.

Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.

Thread

(Original here)

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

As AI systems become more powerful, we need scalable ways to assess... (read 1450 more words →)

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger, Sam Marks

10mo

In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems.

We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.

Introduction:

Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically

... (read 526 more words →)

Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small

RowanWang

RowanWang, Alexandre Variengien, Arthur Conmy, Buck, jsteinhardt

To learn more about this work, check out the paper. We assume general familiarity with transformer circuits.

Intro:

There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding some of the structure of GPT-2 small “in the wild” by studying how it computes a simple natural language task.

The task we investigate is what we call indirect object identification (IOI), where sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary” as opposed to “John”.

We discovered the structure of a circuit of 26 attention heads grouped... (read 2573 more words →)

101

Gears-Level Mental Models of Transformer Interpretability

RowanWang

This post aims to quickly break down and explain the dominant mental models interpretability researchers currently use when thinking about how transformers work.

In my view, the focus of transformer interpretability research is teleological: we care about the functions each component in the transformer performs and how those functions interact to yield the capabilities we see in language models today. From a functional understanding of transformer internals, we then hope to be able to answer other important interpretability questions such as, “Where/How is knowledge stored?,” “Do transformers have beliefs?”, and “How do transformers reason?” As such, the mental models described in this post will be functional theories about how researchers think about transformer internals,... (read 1655 more words →)

Replying toLessons After a Couple Months of Trying to Do ML Research

RowanWang4y

Lessons After a Couple Months of Trying to Do ML Research

Hm, I think this tool would've been really helpful for me in the past for a couple of occasions. Usually if I want to save a cell output, I just won't edit that cell and I'll create a new one, even if it means redundant code.

Also +1 on keeping track of bugs! I should've added to the og post that one thing I do that's really helpful for me is keeping track of procedural knowledge (i.e. how to setup a GPU, how to fix common issue X, etc.) in a personal Slack that I've created as a second brain basically. I found that I used the message-yourself-in-slack feature a lot to keep track of small notes for myself, and since I did it so much, I created a whole private, personal Slack and that's been pretty useful in keeping track of bugs, etc.

Replying toLessons After a Couple Months of Trying to Do ML Research

RowanWang4y

Lessons After a Couple Months of Trying to Do ML Research

Let's definitely catch up!

Replying toLessons After a Couple Months of Trying to Do ML Research

RowanWang4y

Lessons After a Couple Months of Trying to Do ML Research

Yes! Especially if you show you have can provide relevant thoughts about their work, a lot of people will be happy to call or at least reply to some questions via email

Lessons After a Couple Months of Trying to Do ML Research

RowanWang

I’m a 17 y/o who started doing transformer interpretability work around October 2021, mainly for the learning experience, but also with the goal of potentially finding something cool and interesting about transformers. I’m writing this post to consolidate some lessons I learned since then, and I hope that some of these ideas will be useful to other early career people hoping to do ML research. If you think any of this advice is wrong/misleading, please comment!

Getting Good Research Intuitions

For the majority of the past couple months, my work was exploratory. I was reading papers, talking to other researchers in the field, reimplementing things, messing around with other people’s codebases or trying weird... (read 1715 more words →)