' petertodd'’s last stand: The final days of open GPT-3 research

3mo

TL;DR All GPT-3 models were decommissioned by OpenAI in early January. I present some examples of ongoing interpretability research which would benefit from the organisation rethinking this decision and providing some kind of ongoing research access. This also serves as a review of work I did in 2023 and how it progressed from the original ' SolidGoldMagikarp' discovery just over a year ago into much stranger territory.

Introduction

Some months ago, when OpenAI announced that the decommissioning of all GPT-3 models was to occur on 2024-01-04, I decided I would take some time in the days before that to revisit some of my "glitch token" work from earlier in 2023 and deal with any loose ends that would otherwise become impossible to tie up after that date.

This abrupt termination...

(Continue Reading – 13437 more words)

eukaryote2m20

Killer exploration into new avenues of digital mysticism. I have no idea how to assess it but I really enjoyed reading it.

The commenting restrictions on LessWrong seem bad

omnizoid

7mo

I wrote a few controversial articles on LessWrong recently that got downvoted. Now, as a consequence, I can only leave one comment every few days. This makes it totally impossible to participate in various ongoing debates, or even provide replies to the comments that people have made on my controversial post. I can't even comment on objections to my upvoted posts. This seems like a pretty bad rule--those who express controversial views that many on LW don't like shouldn't be stymied from efficiently communicating. A better rule would probably be just dropping the posting limit entirely.

Spiritus Dei39m1-1

I don't think people recognize when they're in an echo chamber. You can imagine a Trump website downvoting all of the Biden followers and coming up with some ridiculous logic like, "And into the garden walks a fool."

The current system was designed to silence the critics of Yudkowski's et al's worldview as it relates to the end of the world. Rather than fully censor critics (probably their actual goal) they have to at least feign objectivity and wait until someone walks into the echo chamber garden and then banish them as "fools".

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah

Ω 2813h

This is a linkpost for https://arxiv.org/abs/2404.16014

Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders!

Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

See Sen's Twitter summary, my Twitter summary, and the paper!

Rohin Shah1hΩ220

This suggestion seems weaker than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.

The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to $ReLU (π_{gate} (x))$ , so you might think of $π_{gate}$ as "tainted", and want to use it as little as possible. The only t... (read more)

5leogao6h

Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking. Some questions: * Is b_mag still necessary in the gated autoencoder? * Did you sweep learning rates for the baseline and your approach? * How large is the dictionary of the autoencoder?

4Neel Nanda5h

Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors

1fvncc7h

Hi any idea how this would compare to just replacing the l1 loss with a smoothed l0 loss function? Something like ∑log(1+a|x|) (summed across the sparse representation).

LLMs seem (relatively) safe

JustisMills

This is a linkpost for https://justismills.substack.com/p/llms-seem-relatively-safe

Post for a somewhat more general audience than the modal LessWrong reader, but gets at my actual thoughts on the topic.

In 2018 OpenAI defeated the world champions of Dota 2, a major esports game. This was hot on the heels of DeepMind’s AlphaGo performance against Lee Sedol in 2016, achieving superhuman Go performance way before anyone thought that might happen. AI benchmarks were being cleared at a pace which felt breathtaking at the time, papers were proudly published, and ML tools like Tensorflow (released in 2015) were coming online. To people already interested in AI, it was an exciting era. To everyone else, the world was unchanged.

Now Saturday Night Live sketches use sober discussions of AI risk as the backdrop for their actual jokes, there are hundreds...

(Continue Reading – 1790 more words)

Vladimir_Nesov1h30

There is enough pre-training text data for $0.1-$1 trillion of compute, if we merely use repeated data and don't overtrain (that is, if we aim for quality, not inference efficiency). If synthetic data from the best models trained this way can be used to stretch raw pre-training data even a few times, this gives something like square of that more in useful compute, up to multiple trillions of dollars.

Issues with LLMs start at autonomous agency, if it happens to be within the scope of scaling and scaffolding. They are thinking too fast, about 100 times faste... (read more)

3Thomas Kwa8h

I don't believe that data is limiting because the finite data argument only applies to pretraining. Models can do self-critique or be objectively rated on their ability to perform tasks, and trained via RL. This is how humans learn, so it is possible to be very sample-efficient, and currently a small proportion of training compute is RL. If the majority of training compute and data are outcome-based RL, it is not clear that the "Playing human roles is pretty human" section holds, because the system is not primarily trained to play human roles.

3JustisMills4h

I think self-critique runs into the issues I describe in the post, though without insider information I'm not certain. Naively it seems like existing distortions would become larger with self-critique, though. For human rating/RL, it seems true that it's possible to be sample efficient (with human brain behavior as an existence proof), but as far as I know we don't actually know how to make it sample efficient in that way, and human feedback in the moment is even more finite than human text that's just out there. So I still see that taking longer than, say, self play. I agree that if outcome-based RL swamps initial training run datasets, then the "playing human roles" section is weaker, but is that the case now? My understanding (could easily be wrong) is that RLHF is a smaller postprocessing layer that only changes models moderately, and nowhere near the bulk of their training.

Losing Faith In Contrarianism

omnizoid

11h

Crosspost from my blog.

If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.

For...

(Continue Reading – 1290 more words)

tailcalled1h30

I'm convinced by the mainstream view on COVID origins and medicine.

I'm ambivalent on education - I guess if done well, it'd consistently have good effects, and that currently, it on average has good effects, but also the effect varies a lot from person to person, so simplistic quantitative reviews don't tell you much. When I did an epistemic spot check on Caplan's book, it failed terribly (it cited a supposedly-ingenious experiment that university didn't improve critical thinking, but IMO the experiment had terrible psychometrics).

I don't know enough about... (read more)

1omnizoid2h

It's not that piece. It's another one that got eaten by a Substack glitch unfortuantely--hopefully it will be back up soon!

3omnizoid2h

He thinks it's very near zero if there is a gap.

2Matthew Barnett2h

The next part of the sentence you quote says, "but it got eaten by a substack glitch". I'm guessing he's referring to a different piece from Sam Atis that is apparently no longer available?

Paul Christiano named as US AI Safety Institute Head of AI Safety

248

Joel Burget

10d

This is a linkpost for https://www.commerce.gov/news/press-releases/2024/04/us-commerce-secretary-gina-raimondo-announces-expansion-us-ai-safety

U.S. Secretary of Commerce Gina Raimondo announced today additional members of the executive leadership team of the U.S. AI Safety Institute (AISI), which is housed at the National Institute of Standards and Technology (NIST). Raimondo named Paul Christiano as Head of AI Safety, Adam Russell as Chief Vision Officer, Mara Campbell as Acting Chief Operating Officer and Chief of Staff, Rob Reich as Senior Advisor, and Mark Latonero as Head of International Engagement. They will join AISI Director Elizabeth Kelly and Chief Technology Officer Elham Tabassi, who were announced in February. The AISI was established within NIST at the direction of President Biden, including to support the responsibilities assigned to the Department of Commerce under the President’s landmark Executive Order.

Paul Christiano, Head of AI Safety, will design

...

(See More – 100 more words)

2Adam Scholl6h

This thread isn't seeming very productive to me, so I'm going to bow out after this. But yes, it is a primary concern—at least in the case of Open Philanthropy, it's easy to check what their primary concerns are because they write them up. And accidental release from dual use research is one of them.

2Davidmanheim2h

If you're appealing to OpenPhil, it might be useful to ask one of the people who was working with them on this as well. And you've now equivocated between "they've induced an EA cause area" and a list of the range of risks covered by biosecurity, and citing this as "one of them." I certainly agree that biosecurity levels are one of the things biosecurity is about, and that "the possibility of accidental deployment of biological agents" is a key issue, but that's incredibly far removed from the original claim that the failure of BSL levels induced the cause area!

2aysja8h

I agree there other problems the EA biosecurity community focuses on, but surely lab escapes are one of those problems, and part of the reason we need biosecurity measures? In any case, this disagreement seems beside the main point that I took Adam to be making, namely that the track record for defining appropriate units of risk for poorly understood, high attack surface domains is quite bad (as with BSL). This still seems true to me.

Davidmanheim2h20

BSL isn't the thing that defines "appropriate units of risk", that's pathogen risk-group levels, and I agree that those are are problem because they focus on pathogen lists rather than actual risks. I actually think BSL are good at what they do, and the problem is regulation and oversight, which is patchy, as well as transparency, of which there is far too little. But those are issues with oversight, not with the types of biosecurity measure that are available.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Gâchis Astronomique

Neil

Le coût d’opportunité des délais en développement technologique

Par Nick Bostrom

Abstract: Grâce à des technologies avancées, on pourrait maintenir une très grande quantité de personnes menant des vies heureuses dans la région accessible de l’univers. Chaque année où la colonisation de l’univers ne se déroule pas représente un coût d’opportunité; des vies qui valent d’êtres vécues ne peuvent être réalisées. D’après des estimations plausibles, ce coût est extrêmement élevé. Mais la leçon pour les utilitaristes n’est pas qu’il faut maximiser la cadence du développement technologique, mais sa sécurité. Autrement dit, il faut maximiser la probabilité que la colonisation se déroule.

Le rythme de perte de vies potentielles

En ce moment, des soleils illuminent et réchauffent des pièces vides et des trous noirs absorbent une portion de l’énergie inutilisée du cosmos. Chaque minute,...

(Continue Reading – 1939 more words)

Bing Chat is blatantly, aggressively misaligned

396

evhub

I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".

Examples below (with new ones added as I find them)....

(See More – 300 more words)

2Evan R. Murphy11h

Thanks, I think you're referring to: There were some ideas proposed in the paper "Conditioning Predictive Models: Risks and Strategies" by Hubinger et al. (2023). But since it was published over a year ago, I'm not sure if anyone has gotten far on investigating those strategies to see which ones could actually work. (I'm not seeing anything like that in the paper's citations.)

Sheikh Abdur Raheem Ali3h10

Appreciate you getting back to me. I was aware of this paper already and have previously worked with one of the authors.

Elizabeth's Shortform

Elizabeth

4kave14h

Enovid is also adding NO to the body, whereas humming is pulling it from the sinuses, right? (based on a quick skim of the paper). I found a consumer FeNO-measuring device for €550. I might be interested in contributing to a replication

Elizabeth4h20

I think that's their guess but they don't directly check here.

I also suspect that it doesn't matter very much.

The sinuses have so much NO compared to the nose that this probably doesn't materially lower sinus concentrations.
the power of humming goes down with each breath but is fully restored in 3 minutes, suggesting that whatever change happens in the sinsues is restored quickly
From my limited understanding of virology and immunology, alternating intensity of NO between sinuses and nose every three minutes is probably better than keeping

... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Introduction

Le rythme de perte de vies potentielles

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA