Sequences

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Mechanistic Interpretability Puzzles
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability
My Overview of the AI Alignment Landscape

Wiki Contributions

Comments

  1. Global Threshold - Let's treat all features the same. Set all feature activations less than [0.1] to 0 (this is equivalent to adding a constant to the encoder bias).

 

The bolded part seems false? This maps 0.2 original act -> 0.2 new act while adding 0.1 to the encoder bias maps 0.2 original act -> 0.1 new act. Ie, changing the encoder bias changes the value of all activations, while thresholding only affects small ones

+1 that I'm still fairly confused about in context learning, induction heads seem like a big part of the story but we're still confused about those too!

This is not a LessWrong dynamic I've particularly noticed and it seems inaccurate to describe it as invisible helicopter blades to me

We've found slightly worse results for MLPs, but nowhere near 40%, I expect you're training your SAEs badly. What exact metric equals 40% here?

Thanks for the post, I found it moving. You might want to add a timestamp at the top saying "written in Nov 2023" or something, otherwise the OpenAI board stuff is jarring

Thanks! This inspired me to buy multiple things that I've been vaguely annoyed to lack

Thanks for writing this up, I found it useful to have some of the maths spelled out! In particular, I think that the equation constraining l, the number of simultaneously active features, is likely crucial for constraining the number of features in superposition

This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list

Load More