LESSWRONGTags
LW

Interpretability (ML & AI)

•

Applied to The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks by Marius Hobbhahn 1h ago

•

Applied to Interpretability: Integrated Gradients is a decent attribution method by StefanHex 9h ago

•

Applied to Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning by Dan Braun 3d ago

•

Applied to How To Do Patching Fast by Joseph Miller 9d ago

•

Applied to Visualizing neural network planning by Nevan Wichers 12d ago

•

Applied to Mechanistic Interpretability Workshop Happening at ICML 2024! by Neel Nanda 18d ago

•

Applied to KAN: Kolmogorov-Arnold Networks by Gunnar_Zarncke 19d ago

•

Applied to Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky 20d ago

•

Applied to Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by Vanessa Kosoy 20d ago

•

Applied to Refusal in LLMs is mediated by a single direction by Neel Nanda 22d ago

•

Applied to Why I stopped being into basin broadness by Gunnar_Zarncke 24d ago

•

Applied to Superposition is not "just" neuron polysemanticity by LawrenceC 25d ago

•

Applied to Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda 25d ago

•

Applied to ProLU: A Nonlinearity for Sparse Autoencoders by Noa Nabeshima 1mo ago

•

Applied to How to use and interpret activation patching by StefanHex 1mo ago

•

Applied to Past Tense Features by Can 1mo ago

•

Applied to [Full Post] Progress Update #1 from the GDM Mech Interp Team by Neel Nanda 1mo ago

•

Applied to [Summary] Progress Update #1 from the GDM Mech Interp Team by Neel Nanda 1mo ago

•

Applied to Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Raemon 1mo ago

•

Applied to Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai 1mo ago