Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah

Ω 233h

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

4Sam Marks20m

Great work! Obviously the results here speak for themselves, but I especially wanted to complement the authors on the writing. I thought this paper was a pleasure to read, and easily a top 5% exemplar of clear technical writing. Thanks for putting in the effort on that. I'll post a few questions as children to this comment.

Sam Marks3mΩ120

I believe that equation (10) giving the analytical solution to the optimization problem defining the relative reconstruction bias is incorrect. I believe the correct expression should be $γ = E_{x \sim D} [\frac{^x \cdot x}{∥ x ∥_{2}^{2}}]$ .

You could compute this by differentiating equation (9), setting it equal to 0 and solving for $γ$ . But here's a more geometrical argument.

By definition, $γ x$ is the multiple of $x$ closest to $^x$ . Equivalently, this closest such vector can be described as the projection ${p r o j}_{x} (^x) = \frac{^x \cdot x}{∥ x ∥_{2}^{2}} x$ . Setting these equal, we get the... (read more)

2Neel Nanda12m

<3 Thanks so much, that's extremely kind. Credit entirely goes to Sen and Arthur, which is even more impressive given that they somehow took this from a blog post to a paper in a two week sprint! (including re-running all the experiments!!)

Examples of Highly Counterfactual Discoveries?

136

johnswentworth, kromem

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To...

(See More – 189 more words)

cubefox11m10

What's more likely: You being wrong about the obviousness of the sphere Earth theory to sailors, or the entire written record (which included information from people who had extensive access to the sea) of two thousand years of Chinese history and astronomy somehow ommitting the spherical Earth theory? Not to speak of other pre-Hellenistic seafaring cultures which also lack records of having discovered the sphere Earth theory.

4Lucius Bushnaq1h

It's measuring the volume of points in parameter space with loss <ϵ when ϵ is infinitesimal. This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with ϵ instead. In real networks trained with finite amounts of data, you care about the case where ϵ is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss <ϵ, if you can manage that. I still think SLT has some neat insights that helped me deconfuse myself about networks. For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss <ϵ using just the eigenvalues of the Hessian. You can't. At least not in general. 1. ^ Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space

4tailcalled2h

Yes.

2tailcalled2h

Link(s) to your favorite proof(s)? Also, do these match up with empirical results? I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don't really need SLT to inoculate me against that. I'd mainly be interested if SLT shows something beyond that. As I read the empirical formulas in this paper, they're roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network. But then so they don't have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it. This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there's a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works. ---------------------------------------- I guess I'm still not super sold on it, but on reflection that's probably partly because I don't have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I'm not sure why I'd want/need to study it further. There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it's plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔

Losing Faith In Contrarianism

omnizoid

22m

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

Why I stopped being into basin broadness

tailcalled

28m

There was a period where everyone was really into basin broadness for measuring neural network generalization. This mostly stopped being fashionable, but I'm not sure if there's enough written up on why it didn't do much, so I thought I should give my take for why I stopped finding it attractive. This is probably a repetition of what others have found, but I thought I might as well repeat it.

Let's say we have a neural network $f_{w} (x) : R^{n}$ . We evaluate it on a dataset $(x, y) \sim D$ using a loss function $L (^y, y) : R$ , to find an optimum $w^{*} = arg {min}_{w} E_{(x, y) \sim D} [L (f_{w} (x), y)]$ . Then there was an idea going around that the Hessian matrix (i.e. the second derivative of $E_{(x, y) \sim D} [L (f_{w} (x), y)]$ at $w^{*}$ ) would tell us something about $w^{*}$ (especially about how well it generalizes).

If we number the dataset $(x_{i}, y_{i})$ , we can stack all the network outputs ${^y}_{i} (w) = f_{w} (x_{i})$ which fits...

(See More – 570 more words)

keltan's Shortform

keltan

6keltan9h

A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!

Dagon42m20

Hmm. I don't doubt that targeted voice-mimicking scams exist (or will soon). I don't think memorable, reused passwords are likely to work well enough to foil them. Between forgetting (on the sender or receiver end), claimed ignorance ("Mom, I'm in jail and really need money, and I'm freaking out! No, I don't remember what we said the password would be"), and general social hurdles ("that's a weird thing to want"), I don't think it'll catch on.

Instead, I'd look to context-dependent auth (looking for more confidence when the ask... (read more)

"Why I Write" by George Orwell (1946)

Arjun Panickssery

This is a linkpost for https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/why-i-write/

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.

From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I...

(Continue Reading – 2660 more words)

cousin_it43m40

Orwell is one of my personal heroes, 1984 was a transformative book to me, and I strongly recommend Homage to Catalonia as well.

That said, I'm not sure making theories of art is worth it. Even when great artists do it (Tolkien had a theory of art, and Flannery O'Connor, and almost every artist if you look close enough), it always seems to be the kind of theory which suits that artist and nobody else. Would advice like "good prose is like a windowpane" or "efface your own personality" improve the writing of, say, Hunter S. Thompson? Heck no, his writing is ... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart

Ω 152d

Abstract

This paper presents $ProLU$ , an alternative to $ReLU$ for the activation function in sparse autoencoders that produces a pareto improvement over both standard sparse autoencoders trained with an L1 penalty and sparse autoencoders trained with a Sqrt(L1) penalty.

$ProLU (m_{i}, b_{i}) = {\begin{matrix} m_{i} & if m_{i} + b_{i} > 0 and m_{i} > 0 0 & otherwise \end{matrix}$

The gradient wrt. $b$ is zero, so we generate two candidate classes of $ProLU$ differentiable wrt. $b$ :

${ProLU}_{R e L U}$
- $\frac{\partial^{*} {ProLU}_{R e L U} (m_{i}, b_{i})}{\partial b_{i}} = \frac{\partial {ProLU}_{R e L U} (m_{i}, b_{i})}{\partial m_{i}} = {\begin{matrix} 1 & if m_{i} + b_{i} > 0 and m_{i} > 0 0 & otherwise \end{matrix}$
${ProLU}_{S T E}$
- $\frac{\partial^{*} {ProLU}_{S T E} (m_{i}, b_{i})}{\partial m_{i}} = {\begin{matrix} 1 + m_{i} & if m_{i} > 0 and m_{i} + b_{i} > 0 0 & otherwise \end{matrix}$
- $\frac{\partial^{*} {ProLU}_{S T E} (m_{i}, b_{i})}{\partial b_{i}} = {\begin{matrix} m_{i} & if m_{i} > 0 and m_{i} + b_{i} > 0 0 & otherwise \end{matrix}$

PyTorch Implementation

Introduction

SAE Context and Terminology

S A E (x) = ReLU ((x - b_{d e c}) W_{e n c} + b_{e n c}) W_{d e c} + b_{d e c}

Learnable parameters of a...

(Continue Reading – 1825 more words)

1wuthejeff6h

This is great! We were working on very similar things concurrently at OpenAI but ended up going a slightly different route. A few questions: - What does the distribution of learned biases look like? - For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?

Glen Taggart45m10

Thank you!

That's super cool you've been doing something similar. I'm curious to see what direction you went in. It seemed like there's a large space of possible things to do along these lines. DeepMind also did a similar but different thing here.

What does the distribution of learned biases look like?

That's a great question, something I didn't note in here is that positive biases have no effect on the output of the SAE -- so, if the biases were to be mostly positive that would suggest this approach is missing something. I saved histograms of the biases duri... (read more)

Bing Chat is blatantly, aggressively misaligned

396

evhub

I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".

Examples below (with new ones added as I find them)....

(See More – 300 more words)

Evan R. Murphy1h20

Thanks, I think you're referring to:

It may still be possible to harness the larger model capabilities without invoking character simulation and these problems, by prompting or fine-tuning the models in some particular careful ways.

There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

eggsyntax's Shortform

eggsyntax

3mo

1eggsyntax7h

Great point. I agree that there are lots of possible futures where that happens. I'm imagining a couple of possible cases where this would matter: 1. Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I'm not too optimistic about this happening, but there's certainly been a lot of increasing AI governance momentum in the last year. 2. Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn't necessarily mean that those systems' preferences were taken into account. I agree entirely. I'm imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals. (not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren't well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they've been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can't. That may be overly pedantic, and I don't feel like I'm articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.

1Ann6h

For the first point, there's also the question of whether 'slightly superhuman' intelligences would actually fit any of our intuitions about ASI or not. There's a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it ... Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There's also a desire specifically not to build them, which might lead to systems with an architecture that isn't like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in - it'd be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation. I do think the drive/just a thing it does we're pointing at with 'what the model just does' is distinct from goals as they're traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that's being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn't want to lose prediction on).

1eggsyntax1h

Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it's difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems. My quibble is that the trainers are optimizing for an objective, at training time, but the model isn't optimizing for anything, at training or inference time. I feel we're very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.

eggsyntax1h10

the model isn't optimizing for anything, at training or inference time.

One maybe-useful way to point at that is: the model won't try to steer toward outcomes that would let it be more successful at predicting text.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Abstract

Introduction

SAE Context and Terminology

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA