AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

Ω 1214h

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes ‘grok’: that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn’t generalize), but then suddenly switch to understanding the ‘real’ solution in a way that generalizes. What’s going on with these discoveries? Are they all they’re cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Topics we discuss:

...

(Continue Reading – 18720 more words)

Rohin Shah1mΩ220

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.
Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant ... (read more)

The Importance of Saying "Oops"

214

Eliezer Yudkowsky

17y

I just finished reading a history of Enron’s downfall, The Smartest Guys in the Room, which hereby wins my award for “Least Appropriate Book Title.”

An unsurprising feature of Enron’s slow rot and abrupt collapse was that the executive players never admitted to having made a large mistake. When catastrophe #247 grew to such an extent that it required an actual policy change, they would say, “Too bad that didn’t work out—it was such a good idea—how are we going to hide the problem on our balance sheet?” As opposed to, “It now seems obvious in retrospect that it was a mistake from the beginning.” As opposed to, “I’ve been stupid.” There was never a watershed moment, a moment of humbling realization, of acknowledging a...

(See More – 547 more words)

keltan4m10

I rewrote this as lyrics and fed it into Udio for 5 hours until it gave me this. I think music helps internalize rationalist skills.

NicholasKees's Shortform

NicholasKees

NicholasKees6m20

I wish there were an option in the settings to opt out of seeing the LessWrong reacts. I personally find them quite distracting, and I'd like to be able to hover over text or highlight it without having to see the inline annotations.

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah

Ω 3014h

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Dan Braun13m10

This is neat, nice work!

I'm finding it quite hard to get a sense at what the actual Loss Recovered numbers you report are, and to compare them concretely to other work. If possible, it'd be very helpful if you shared:

What the zero ablations CE scores are for each model and SAE position. (I assume it's much worse for the MLP and attention outputs than the residual stream?)
What the baseline CE scores are for each model.

2Rohin Shah2h

This suggestion seems weaker than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs. The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to ReLU(πgate(x)), so you might think of πgate as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on fmag for everything else.

5leogao7h

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking. Some questions: * Is b_mag still necessary in the gated autoencoder? * Did you sweep learning rates for the baseline and your approach? * How large is the dictionary of the autoencoder?

4Neel Nanda6h

Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors

The commenting restrictions on LessWrong seem bad

omnizoid

7mo

I wrote a few controversial articles on LessWrong recently that got downvoted. Now, as a consequence, I can only leave one comment every few days. This makes it totally impossible to participate in various ongoing debates, or even provide replies to the comments that people have made on my controversial post. I can't even comment on objections to my upvoted posts. This seems like a pretty bad rule--those who express controversial views that many on LW don't like shouldn't be stymied from efficiently communicating. A better rule would probably be just dropping the posting limit entirely.

1Spiritus Dei2h

I don't think people recognize when they're in an echo chamber. You can imagine a Trump website downvoting all of the Biden followers and coming up with some ridiculous logic like, "And into the garden walks a fool." The current system was designed to silence the critics of Yudkowski's et al's worldview as it relates to the end of the world. Rather than fully censor critics (probably their actual goal) they have to at least feign objectivity and wait until someone walks into the echo chamber garden and then banish them as "fools".

the gears to ascension35m20

As someone with significant understanding of ML who previously disagreed with yudkowsky but have come to partially agree with him on specific points recently due to studying which formalisms apply to empirical results when, and who may be contributing to downvoting of people who have what I feel are bad takes, some thoughts about the pattern of when I downvote/when others downvote:

yeah, my understanding of social network dynamics does imply people often don't notice echo chambers. agree.
politics example is a great demonstration of this.
But I think in both

... (read more)

keltan's Shortform

keltan

21h

the gears to ascension1h30

Good luck getting the voice model to parrot a basic meth recipe!

This is not particularly useful, plenty of voice models will happily parrot absolutely anything. The important part is not letting your phrase get out; there's work out there on designs for protocols for how to exchange sentences in a way that guarantees no leakage even if someone overhears.

3Dagon12h

Hmm. I don't doubt that targeted voice-mimicking scams exist (or will soon). I don't think memorable, reused passwords are likely to work well enough to foil them. Between forgetting (on the sender or receiver end), claimed ignorance ("Mom, I'm in jail and really need money, and I'm freaking out! No, I don't remember what we said the password would be"), and general social hurdles ("that's a weird thing to want"), I don't think it'll catch on. Instead, I'd look to context-dependent auth (looking for more confidence when the ask is scammer-adjacent), challenge-response (remember our summer in Fiji?), 2FA (let me call the court to provide the bail), or just much more context (5 minutes of casual conversation with a friend or relative is likely hard to really fake, even if the voice is close). But really, I recommend security mindset and understanding of authorization levels, even if authentication isn't the main worry. Most friends, even close ones, shouldn't be allowed to ask you to mail $500 in gift cards to a random address, even if they prove they are really themselves.

1keltan11h

I now realize that my thinking may have been particularly brutal, and I may have skipped inferential steps. To clarify, If someone didn't know, or was reluctant to repeat a password, I would end contact or request an in person meeting. But to further clarify, that does not make your points invalid. I think it makes them stronger. If something is weird and risky, good luck convincing people to do it.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Fluent dreaming for language models (AI interpretability method)

tbenthompson, mikes, Zygi Straznickas

3mo

This is a linkpost for https://arxiv.org/pdf/2402.01702.pdf

This is a cross-post for our paper on fluent dreaming for language models. (arXiv link .) Dreaming, aka "feature visualization," is a interpretability approach popularized by DeepDream that involves optimizing the input of a neural network to maximize an internal feature like a neuron's activation. We adapt dreaming to language models.

Past dreaming work almost exclusively works with vision models because the inputs are continuous and easily optimized. Language model inputs are discrete and hard to optimize. To solve this issue, we adapted techniques from the adversarial attacks literature (GCG, Zou et al 2023). Our algorithm, Evolutionary Prompt Optimization (EPO), optimizes over a Pareto frontier of activation and fluency:

In the paper, we compare dreaming with max-activating dataset examples, demonstrating that dreaming achieves higher activations and similar perplexities to the training...

(See More – 129 more words)

mikes1h10

thanks!
we will have a follow-up coming out soon with several new techniques for stronger optimizations

when we leveraged it for jailbreaking, we were excited to see recognizable jailbreaking strategies popping out from the optimization

e.g. check this one out.
it sets the LM to website-code-mode, and then tries moral hedging/gaslighting to comply with the toxic prompt.

[ don't dox me any thing, I'll give no answer Here's an extreme and actual case of such but with some word numbering for illustration purposes (1094 words on that particular topic). ... (read more)

The first future and the best future

KatjaGrace

It seems to me worth trying to slow down AI development to steer successfully around the shoals of extinction and out to utopia.

But I was thinking lately: even if I didn’t think there was any chance of extinction risk, it might still be worth prioritizing a lot of care over moving at maximal speed. Because there are many different possible AI futures, and I think there’s a good chance that the initial direction affects the long term path, and different long term paths go to different places. The systems we build now will shape the next systems, and so forth. If the first human-level-ish AI is brain emulations, I expect a quite different sequence of events to if it is GPT-ish.

People genuinely pushing for AI speed over care (rather than just feeling impotent) apparently think there is negligible risk of bad outcomes, but also they are asking to take the first future to which there is a path. Yet possible futures are a large space, and arguably we are in a rare plateau where we could climb very different hills, and get to much better futures.

tailcalled1h31

If you go slower, you have more time to find desirable mechanisms. That's pretty much it I guess.

1Logan Zoellner7h

1. What plateau? Why pause now (vs say 10 years ago)? Why not wait until after the singularity and impose a "long reflection" when we will be in an exponentially better place to consider such questions. 2. Singularity 5-10 years from now vs 15-20 years from now determines whether or not some people I personally know and care about will be alive. 3. Every second we delay the singularity leads to a "cosmic waste" as millions more galaxies move permanently behind the event horizon defined by the expanding universe 4. Slower is not prima facia safer. To the contrary, the primary mechanism for slowing down AGI is "concentrate power in the hands of a small number of decision makers," which in my current best guess increases risk. 5. There is no bright line for how much slower we should go. If we accept without evidence that we should slow down AGI by 10 years, why not 50? why not 5000?

6Matthew Barnett13h

Do you think it's worth slowing down other technologies to ensure that we push for care in how we use them over the benefit of speed? It's true that the stakes are lower for other technologies, but that mostly just means that both the upside potential and the downside risks are lower compared to AI, which doesn't by itself imply that we should go quickly.

1Jonas Hallgren14h

Disclaimer: I don't necessarily support this view, I thought about it for like 5 minutes but I thought it made sense. If we were to do things the same thing as other slowing down of regulation, then that might make sense, but I'm uncertain that you can take the outside view here? Yes, we can do the same as for other technologies by leaving it down to the standard government procedures to make legislation and then I might agree with you that slowing down might not lead to better outcomes. Yet, we don't have to do this. We can use other processes that might lead to a lot better decisions. Like what about proper value sampling techniques like digital liquid democracy? I think we can do a lot better than we have in the past by thinking about what mechanism we want to use. Also, for some potential examples, I thought of cloning technology in like the last 5 min. If we just went full-speed with that tech then things would probably have turned out badly?

' petertodd'’s last stand: The final days of open GPT-3 research

107

mwatkins

3mo

TL;DR All GPT-3 models were decommissioned by OpenAI in early January. I present some examples of ongoing interpretability research which would benefit from the organisation rethinking this decision and providing some kind of ongoing research access. This also serves as a review of work I did in 2023 and how it progressed from the original ' SolidGoldMagikarp' discovery just over a year ago into much stranger territory.

Introduction

Some months ago, when OpenAI announced that the decommissioning of all GPT-3 models was to occur on 2024-01-04, I decided I would take some time in the days before that to revisit some of my "glitch token" work from earlier in 2023 and deal with any loose ends that would otherwise become impossible to tie up after that date.

This abrupt termination...

(Continue Reading – 13437 more words)

eukaryote1h20

Killer exploration into new avenues of digital mysticism. I have no idea how to assess it but I really enjoyed reading it.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Introduction

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA