I'm not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.

My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:

Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
This scrutiny will then propagate backwards to finding causes or

Curt Tigges1y

How do you deal w/ Super Stimuli?

I use Freedom and Limit on my computer and Stay Focused on my Android phone. The former two allow for a combination of complete blocking during certain time windows and time limits (for any website, even across browsers and even if you open an incognito window). The latter does both for my phone.

I block all social media and content during prime working hours and implement a 30-minute limit outside of that. It works pretty well. I may make it more strict because I sometimes find myself looking at Twitter, etc. occasionally when watching a TV show in the evenings.

I also use BlockTube to get rid of YouTube Shorts entirely from my web browser. They no longer show up in search results or in the menu.

Finally, I recommend the tools here, though I haven't tried all of them: https://liamrosen.com/2023/04/18/modding-social-media-to-win-the-attention-war/

Replying toSomething Is Lost When AI Makes Art

Curt Tigges1y

Something Is Lost When AI Makes Art

I find this argument quite compelling, and this is also why I find the idea of "AI girl/boyfriends" largely uninteresting. Without actual connection to another mind (that has experiences and phenomenal consciousness), any of these things--art, deep conversations about thoughts/feelings, what have you--eventually falls flat. (That includes one-way connection through art).

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks, Neel Nanda

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution

TL;DR

We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and downstream tasks. Use our codebase to evaluate your own SAEs!
You can compare 200+ SAEs of varying sparsity, dictionary size, architecture, and training time on Neuronpedia.
Think we're missing an eval? We'd love for you to contribute it to our codebase! Email us.

🔍 Explore the Benchmark & Rankings

📊 Evaluate your SAEs with SAEBench

✉️ Contact Us

Introduction

Sparse Autoencoders (SAEs) have become one of the most popular tools for AI... (read 381 more words →)

I quite enjoyed reading this. Very evocative.

Welcome to San Francisco.

Stitching SAEs of different sizes

Bart Bussmann

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges, Neel Nanda

Work done in Neel Nanda’s stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University

TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.

Introduction

Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when... (read 3435 more words →)

Replying toThe Best Tacit Knowledge Videos on Every Subject

Curt Tigges2y

The Best Tacit Knowledge Videos on Every Subject

Domain: Software engineering, mech interp

Bryce Meyer (primary maintainer of TransformerLens, and software engineer with many years of experience) has a weekly coding stream event where he does live coding on TransformerLens--resolving bugs, adding features and tests, etc. I've found it to be useful!

You can find it in the Open Source Mechanistic Interpretability Slack, under the "code-sessions" channel (feel free to DM for an invite).

Replying toTalent Needs of Technical AI Safety Teams

Curt Tigges2y

Talent Needs of Technical AI Safety Teams

Great post, but there is one part I'd like to push back on:

Iterators are also easier to identify, both by their resumes and demonstrated skills. If you compare two CVs of postdocs that have spent the same amount of time in academia, and one of them has substantially more papers (or GitHub commits) to their name than the other (controlling for quality), you’ve found the better Iterator. Similarly, if you compare two CodeSignal tests with the same score but different completion times, the one completed more quickly belongs to the stronger Iterator.

This seems like a bit of an over-claim. I would endorse a weaker claim, like "in the presence of a high... (read more)

Replying tomy note system

Curt Tigges2y

my note system

Perhaps more important than these details: How do you curate input to take notes on, and what is the purpose you take the notes for? How do you use the notes once written? (This latter point seems to be one of the biggest reason many people have dropped PKM systems.)

Replying toDating Roundup #3: Third Time’s the Charm

Curt Tigges2y

Dating Roundup #3: Third Time’s the Charm

Very kind of you to say. :) I think for me, though, the source of the emotion I felt when reading this series was something like: "Ah, so in addition to ensuring we are dateable ourselves, we must fix society, capitalism (at least the dating part of it), culture, etc. in order to have a Good Dating Universe." Which in retrospect was a bit overblown of me, so I think I no longer endorse the strong version of what I said in that comment.

Replying toDating Roundup #3: Third Time’s the Charm

Curt Tigges2y*

Dating Roundup #3: Third Time’s the Charm

I think this list may successfully convince some to stay off the dating market indefinitely. Who in the world has time to work on all of this? At best, this is just a massive set of to-dos; at worst, it's an enormous list of all the ways the dating world sucks and reasons why you'll fail.

Upon reflection: This is a good collection of information, even if it is rather discouraging to read. May we all find exceptions to the unfortunate trends that seem to characterize the modern dating landscape.

Replying toWhy I no longer identify as transhumanist

Curt Tigges2y

Why I no longer identify as transhumanist

I actually went through the same process as what you describe here, but it didn't remove my "transhumanist" label. I was a big fan of Humanity+, excited about human upgrading, etc. etc. I then became disillusioned about progress in the relevant fields, started to understand nonduality and the lack of a persistent or independent self, and realized AI was the only critical thing that actually was in the process of happening.

In that sense, my process was similar but I still consider myself a transhumanist. Why? Because for me, solving death or trying to make progress in the scientific fields that lead to various types of augmentations aren't the biggest or most critical... (read more)

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt Tigges

TL;DR: I demonstrate how to use RLHF models trained with the TRLX library with TransformerLens, and how to take an exploratory look at how RLHF changes model internals. I also use activation patching to see what RLHF activations are sufficient to recreate some of the RLHF behavior in the source model. Note that this is simply a preliminary exploratory analysis, and much (much) work remains to be done. I hope to show that doing mechanistic interpretability analysis with RLHF models doesn't need to be intimidating and is quite approachable!

Introduction

LLMs trained with RLHF are a prominent paradigm in the current AI landscape, yet not much mechanistic interpretability work has been done on these... (read 3126 more words →)

LESSWRONG
LW

LESSWRONG
LW

Curt Tigges

Curt Tigges

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Stitching SAEs of different sizes

Exploratory Analysis of RLHF Transformers with TransformerLens

Curt Tigges

Curt Tigges

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Stitching SAEs of different sizes

Exploratory Analysis of RLHF Transformers with TransformerLens

TL;DR

Introduction

Introduction

Introduction