+1 that I'm still fairly confused about in context learning, induction heads seem like a big part of the story but we're still confused about those too!
This is not a LessWrong dynamic I've particularly noticed and it seems inaccurate to describe it as invisible helicopter blades to me
We've found slightly worse results for MLPs, but nowhere near 40%, I expect you're training your SAEs badly. What exact metric equals 40% here?
Thanks for the post, I found it moving. You might want to add a timestamp at the top saying "written in Nov 2023" or something, otherwise the OpenAI board stuff is jarring
Thanks for writing this up, I found it useful to have some of the maths spelled out! In particular, I think that the equation constraining l, the number of simultaneously active features, is likely crucial for constraining the number of features in superposition
This seems like a useful resource, thanks for making it! I think it would be more useful if you enumerated the different ARENA notebooks, my guess is many readers won't click through to the link, and are more likely to if they see the different names. And IMO the arena tutorials are much higher production quality than the other notebooks on that list
The bolded part seems false? This maps 0.2 original act -> 0.2 new act while adding 0.1 to the encoder bias maps 0.2 original act -> 0.1 new act. Ie, changing the encoder bias changes the value of all activations, while thresholding only affects small ones