Arthur Conmy

Intepretability 

Views my own

Wiki Contributions

Comments

+1 to Neel. We just fixed a release bug and now pip install transformer-lens should install 1.16.0 (worked in a colab for me)

Arthur Conmy1dΩ122214

I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. "Citing others" from this Chris Olah blog post), but the authors resolved this when pushed.

But on the other hand, "Section 6.2 of the RepE paper shows exactly this" and accusations of plagiarism seem wrong @Dan H. Changing experimental setups and scaling them to larger models is valuable original work.

(Disclosure: I know all authors of the post, but wasn't involved in this project)

(ETA: I added the word "bypassing". Typo.)

We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value. 

For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see noticable difference in the Pareto frontier changing between 0.001 and 0.00075 so did not sweep the baseline hyperparameter further than this.

Oh oops, thanks so much. We'll update the paper accordingly. Nit: it's actually 



(it's just minimizing a quadratic)

ETA: the reason we have complicated equations is that we didn't compute  during training (this quantity is kinda weird). However, you can compute  from quantities that are usually tracked in SAE training. Specifically,  and all terms here are clearly helpful to track in SAE training.

We haven't tried this yet. Thanks, that's a good hypothesis.

I suspect that the mean centering paper https://arxiv.org/abs/2312.03813 is just cancelling the high frequency features, and if so this is a good explanation for why taking differences is important in activation steering. 

(Though it doesn't explain why the SAEs learn several high frequency features when trained on the residual stream)

Yes, pretty much.

There's some work on transferring steering vecs, e.g. the Llama-2 steering paper (https://arxiv.org/abs/2312.06681) shows that you can transfer steering vecs from base to chat model, and I saw results at a Hackathon once that suggested you can train resid stream SAEs on early layers and transfer them to some later layers, too. But retraining is likely what our follow up work will do (this post only used two different SAEs)

Why is CE loss >= 5.0 everywhere? Looking briefly at GELU-1L over 128 positions (a short sequence length!) I see our models get 4.3 CE loss. 5.0 seems really high?

Ah, I see your section on this, but I doubt that bad data explains all of this. Are you using a very small sequence length, or an odd dataset?

Answer by Arthur ConmyMar 08, 2024127

From my perspective this term appeared around 2021 and became basically ubiquitous by 2022


I don't think this is correct. To add to Steven's answer, in the "GPT-1" paper from 2018 the abstract discusses 

...generative pre-training of a language model
on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each
specific task

and the assumption at the time was that the finetuning step was necessary for the models to be good at a given task. This assumption persisted for a long time with academics finetuning BERT on tasks that GPT-3 would eventually significantly outperformed them on. You can tell this from how cautious the GPT-1 authors are about claiming the base model could do anything, and they sound very quaint:

> We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability

The fact that Pythia generalizes to longer sequences but GPT-2 doesn't isn't very surprising to me -- getting long context generalization to work is a key motivation for rotary, e.g. the original paper https://arxiv.org/abs/2104.09864

Do you apply LR warmup immediately after doing resampling (i.e. immediately reducing the LR, and then slowly increasing it back to the normal value)? In my GELU-1L blog post I found this pretty helpful (in addition to doing LR warmup at the start of training)

Load More