Intepretability
Views my own
I would say a better reference for the limitations of ROME is this paper: https://aclanthology.org/2023.findings-acl.733
Short explanation: Neel's short summary, i.e. editing in the Rome fact will also make slightly related questions e.g. "The Louvre is cool. Obama was born in" ... be completed with " Rome" too.
I agree that twitter is a worse use of time.
Going to posters for works you already know to talk to authors seems a great idea and I do it. Re-reading your OP, you suggest things like checking papers are fake or not in poster sessions. Maybe you just meant papers that you already knew about? It sounded as if you were suggesting doing this for random papers, which I'm more skeptical about.
My opinion is that going to poster sessions, orals, pre-researching papers etc. at ICML/ICLR/NeurIPS is pretty valuable for new researchers and I wish I had done this before having any papers (you don't need to have any papers to go to a conference). See also Thomas Kwa's comment about random intuitions learnt from going to a conference.
After this, I agree with Leo that I think it would be a waste of my time to go to papers/orals/preresearch papers. Maybe there's some value in this for conceptual research but for most empirical work I'm very skeptical (most papers are not good, but it takes my time to figure out whether a paper is good or not, etc.)
If there are some very common features in particular layers (e.g. an 'attend to BOS' feature), then restricting one expert to be active at a time will potentially force SAEs to learn common features in every expert.
+1 to similar concerns -- I would have probably left one expert always on. This should both remove some redundant features.
Relevant further context: Gray Swan's Cygnet-8B Llama finetune (which uses circuit breakers and probably other safety training too, and had impressive seeming 0.0 scores in some red teaming evals in the paper) was jailbroken in 3 hours: https://x.com/elder_plinius/status/1813670661270786137
My takeaway from the blog post was that circuit breakers have fairly simple vulnerabilities. Since circuit breakers are an adversarial robustness method (not a capabilities method) I think you can update on the results of single case studies (i.e. worst case evaluations rather than average case evaluations).
Geminis generally search the internet, which is why they show up on LmSys without a knowledge cutoff date. Even when there's no source attached, the model still knows information from 4 days ago via the internet (image attached). But I think in your response the [1] shows the model did find a internet source for the GUID anyway??
Unless you're using the API here and the model is being weird? Without internet access, I expect it's possible to coax the string out but the model refuses requests a lot so I think it would require a bit of elbow grease.
Mistral and Pythia use rotary embeddings and don't have a positional embedding matrix. Which matrix are you looking at for those two models?
They emailed some people about this: https://x.com/brianryhuang/status/1763438814515843119
The reason is that it may allow unembedding matrix weight stealing: https://arxiv.org/abs/2403.06634
Yes. On the AGI safety and alignment team we are working on activation steering - e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don't have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see "safety practices we can test right now" as a research goal.