LessWrong

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah

Ω 294h

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Senthooran Rajamanoharan3m10

Hey Sam, thanks - you're right. The definition of reconstruction bias is actually the argmin of

$E [|^x / γ^{'} - x |^{2}]$

which I'd (incorrectly) rearranged as the expression in the paper. As a result, the optimum is

$γ^{- 1} = E [^x \cdot x] / E [|^x |^{2}]$

That being said, the derivation we gave was not quite right, as I'd incorrectly substituted the optimised loss rather than the original reconstruction loss, which makes equation (10) incorrect. However the difference between the two is small when the losses are similar, which is probably why we didn't pick this up. Anyway, we plan t... (read more)

2Rohin Shah6m

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio E[||^x||2]E[||x||2]. Looking at the γ that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed E[||^x||2]E[||x||2] for Gated SAEs, I expect that would be below 1.) I think the main thing we got wrong is that we accidentally treated E[||^x−x||2] as though it were E[||^x−γx||2]. To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually γ is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small. We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.

1Arthur Conmy11m

Ah yes, we actually need adjust the definition, and we'll get back to you rather than risking making subtly wrong claims again.

2Charlie Steiner41m

Nice. I tried to do something similar (except making everything leaky with polynomial tails, so y = (y+torch.sqrt(y**2+scale**2)) * (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) / 4 where the first part (y+torch.sqrt(y**2+scale**2)) is a softplus, and the second part (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) is a leaky cutoff at the value threshold. But I don't think I got such clearly better results, so I'm going to have to read more thoroughly to see what else you were doing that I wasn't :)

Rafael Harth's Shortform

Rafael Harth

Ω 24y

Rafael Harth11m20

Are people in rich countries happier on average than people in poor countries? (According to GPT-4, the academic consensus is that it does, but I'm not sure it's representing it correctly.) If so, why do suicide rates increase (or is that a false positive)? Does the mean of the distribution go up while the tails don't or something?

Chicago Rationality

May Meetup – Social Media and Mental Health

May 4th645 South Clark Street, Chicago

shaneebert

The next monthly discussion meetup is Saturday, May 4 @ 2 PM (see below for location etc.). For this meetup, we’ll be discussing the relationship between social media and mental health, with a focus on arguments by psychologist Jonathan Haidt (particularly from his new book) and some criticisms of his interpretation of the evidence. There are readings, a podcast, and a video below, BUT feel free to come whether or not you’ve reviewed any of it, as always. (NOTE: The June topic/discussion meetup will be moved to Saturday, June 8 @ 2 PM)

READINGS ETC.

Background and info from Jonathan Haidt
- The evidence, from the book website: https://www.anxiousgeneration.com/research/the-evidence
- A talk on the material from the book: https://www.youtube.com/watch?v=yVq4ARIlNVg
- A guide to facilitate exploration of ideas and experiences touched on in the book: https://www.anxiousgeneration.com/pdfs/thought-starters-for-gen-z.pdf

...

(See More – 120 more words)

Thoughts on seed oil

253

dynomight

This is a linkpost for https://dynomight.net/seed-oil/

A friend has spent the last three years hounding me about seed oils. Every time I thought I was safe, he’d wait a couple months and renew his attack:

“When are you going to write about seed oils?”

“Did you know that seed oils are why there’s so much {obesity, heart disease, diabetes, inflammation, cancer, dementia}?”

“Why did you write about {meth, the death penalty, consciousness, nukes, ethylene, abortion, AI, aliens, colonoscopies, Tunnel Man, Bourdieu, Assange} when you could have written about seed oils?”

“Isn’t it time to quit your silly navel-gazing and use your weird obsessive personality to make a dent in the world—by writing about seed oils?”

He’d often send screenshots of people reminding each other that Corn Oil is Murder and that it’s critical that we overturn our lives...

(Continue Reading – 4926 more words)

LoganStrohl13m20

i'm sorry not to be engaging with the content of the post here; hopefully others have that covered. but i just wanna say, man this is so well written! at the sentence and paragraph level especially, i find it inspiring. it makes me wanna write more like i'm drunk and dgaf, though i doubt that exact thing would actually suffice to allow me to hit a similar stylistic target.

(the rest of this comment is gonna be largely for me and my own development, but maybe you'll like reading it anyway.)

i think you do a bunch of stuff that current me is too chicken to try... (read more)

1Slapstick8h

I would consider most bread sold in stores to be processed or ultra processed and I think that's a pretty standard view but it's true there might be some confusion. I would consider all of those to be processed and unhealthy and I think thats a pretty standard view, but fair enough if there's some confusion around those things. I guess my view is that it's mostly not hogwash? The least healthy things are clearly and broadly much more processed than the healthiest things.

1Slapstick8h

I typically consume my greens with ground flax seeds in a smoothie. I feel very confident that adding refined oil to vegetables shouldn't be considered healthy, in the sense that the opportunity cost of 1 Tablespoon of olive oil is 120 calories, which is over a pound of spinach for example. Certainly it's difficult to eat that much spinach and it's probably unwise, but I just say that to illustrate that you can get a lot more nutrition from 120 calories than the oil will be adding, even if it makes the greens more bioavailable. That said "healthy" is a complicated concept. If adding some oil to greens helps something eat greens they otherwise wouldn't eat for example, that's great.

1Ann8h

Raw spinach in particular also has high levels of oxalic acid, which can interfere with the absorption of other nutrients, and cause kidney stones when binding with calcium. Processing it by cooking can reduce its concentration and impact significantly without reducing other nutrients in the spinach as much. Grinding and blending foods is itself processing. I don't know what impact it has on nutrition, but mechanically speaking, you can imagine digestion proceeding differently depending on how much of it has already been done. You do need a certain amount of macronutrients each day, and some from fat. You also don't necessarily want to overindulge on every micronutrient. If we're putting a number of olives in our salad equivalent to the amount of olive oil we'd otherwise use, we'll say 100 4g olives, that we've lowered the sodium from by some means to keep that reasonable ... that's 72% of recommended daily value of our iron and 32% of our calcium. We just mentioned that spinach + calcium can be a problem; and the pound of spinach itself contains 67% of iron and 45% of our calcium. ... That's also 460 calories worth of olives. I'm not sure if we've balanced our salad optimally here. Admittedly, if I'm throwing this many olives in with this much spinach in the first place, I'm probably going to cook the spinach, throw in some pesto and grains or grain products, and then I've just added more olive oil back in again ... ;) And yeah, greens with oil might taste better or be easier to eat than greens just with fatty additions like nuts, seeds, meat, or eggs.

Losing Faith In Contrarianism

omnizoid

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

mako yass15m20

It may be useful to write about how a consumer can distinguish contrarian takes from original insights. Until that's a common skill, there will remain a market for contrarians.

2romeostevensit1h

I think binary examples are deceptive in the reversed stupidity is not intelligence sense. Thinking through things from first principles is most important in areas that are new or rapidly changing where there are fewer references classes and experts to talk to. It's also helpful for areas where the consensus view is optimized for someone very unlike you.

2Dagon2h

I tend to read most of the high-profile contrarians with a charitable (or perhaps condescending) presumption that they're exaggerating for effect. They may say something in a forceful tone and imply that it's completely obvious and irrefutable, but that's rhetoric rather than truth. In fact, if they're saying "the mainstream and common belief should move some amount toward this idea", I tend to agree with a lot of it (not all - there's a large streak of "contrarian success on some topics causes very strong pressure toward more contrarianism" involved).

LLMs seem (relatively) safe

JustisMills

This is a linkpost for https://justismills.substack.com/p/llms-seem-relatively-safe

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds...

(Continue Reading – 1790 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

"Why I Write" by George Orwell (1946)

Arjun Panickssery

This is a linkpost for https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/why-i-write/

People have been posting great essays so that they're "fed through the standard LessWrong algorithm." This essay is in the public domain in the UK but not the US.

From a very early age, perhaps the age of five or six, I knew that when I grew up I should be a writer. Between the ages of about seventeen and twenty-four I tried to abandon this idea, but I did so with the consciousness that I was outraging my true nature and that sooner or later I should have to settle down and write books.

I was the middle child of three, but there was a gap of five years on either side, and I barely saw my father before I was eight. For this and other reasons I...

(Continue Reading – 2660 more words)

6cousin_it2h

Orwell is one of my personal heroes, 1984 was a transformative book to me, and I strongly recommend Homage to Catalonia as well. That said, I'm not sure making theories of art is worth it. Even when great artists do it (Tolkien had a theory of art, and Oscar Wilde, and Flannery O'Connor, and almost every artist if you look close enough), it always seems to be the kind of theory which suits that artist and nobody else. Would advice like "good prose is like a windowpane" or "efface your own personality" improve the writing of, say, Hunter S. Thompson? Heck no, his writing is the opposite of that and charming for it! Maybe the only possible advice to an artist is to follow their talent, and advising anything more specific is as likely to hinder as help.

Viliam1h20

The theories are probably just rationalizations anyway.

keltan's Shortform

keltan

11h

6keltan11h

A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange. An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters. Good luck getting the voice model to parrot a basic meth recipe!

3Dagon2h

Hmm. I don't doubt that targeted voice-mimicking scams exist (or will soon). I don't think memorable, reused passwords are likely to work well enough to foil them. Between forgetting (on the sender or receiver end), claimed ignorance ("Mom, I'm in jail and really need money, and I'm freaking out! No, I don't remember what we said the password would be"), and general social hurdles ("that's a weird thing to want"), I don't think it'll catch on. Instead, I'd look to context-dependent auth (looking for more confidence when the ask is scammer-adjacent), challenge-response (remember our summer in Fiji?), 2FA (let me call the court to provide the bail), or just much more context (5 minutes of casual conversation with a friend or relative is likely hard to really fake, even if the voice is close). But really, I recommend security mindset and understanding of authorization levels, even if authentication isn't the main worry. Most friends, even close ones, shouldn't be allowed to ask you to mail $500 in gift cards to a random address, even if they prove they are really themselves.

keltan1h10

I now realize that my thinking may have been particularly brutal, and I may have skipped inferential steps.

To clarify, If someone didn't know, or was reluctant to repeat a password, I would end contact or request an in person meeting.

But to further clarify, that does not make your points invalid. I think it makes them stronger. If something is weird and risky, good luck convincing people to do it.

Simple probes can catch sleeper agents

112

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez, evhub

Ω 662d

This is a linkpost for https://www.anthropic.com/research/probes-catch-sleeper-agents

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.

Twitter thread here.

Top-level summary:

In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and

...

(See More – 205 more words)

4Wuschel Schulz11h

Super interesting! In the figure with the caption: Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal. Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels? So, maybe Claude thinks that green is better than blue? Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd be interested whether there are similarities. It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.

Monte M1h52

Thanks for the cool idea about attempting to train probing-resistant sleeper agents!

3evhub4h

It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn't pick which "side" of the unrelated questions like that would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we present for the 1000 random directions, where you can see some there that do quite well also, but not in a systematic way where you can predict in advance how to use them as effective classifiers.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

READINGS ETC.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA