This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Interpretability (ML & AI)
•
Applied to
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
by
Marius Hobbhahn
1h
ago
•
Applied to
Interpretability: Integrated Gradients is a decent attribution method
by
StefanHex
9h
ago
•
Applied to
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
by
Dan Braun
3d
ago
•
Applied to
How To Do Patching Fast
by
Joseph Miller
9d
ago
•
Applied to
Visualizing neural network planning
by
Nevan Wichers
12d
ago
•
Applied to
Mechanistic Interpretability Workshop Happening at ICML 2024!
by
Neel Nanda
18d
ago
•
Applied to
KAN: Kolmogorov-Arnold Networks
by
Gunnar_Zarncke
19d
ago
•
Applied to
Transcoders enable fine-grained interpretable circuit analysis for language models
by
Jacob Dunefsky
20d
ago
•
Applied to
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
by
Vanessa Kosoy
20d
ago
•
Applied to
Refusal in LLMs is mediated by a single direction
by
Neel Nanda
22d
ago
•
Applied to
Why I stopped being into basin broadness
by
Gunnar_Zarncke
24d
ago
•
Applied to
Superposition is not "just" neuron polysemanticity
by
LawrenceC
25d
ago
•
Applied to
Improving Dictionary Learning with Gated Sparse Autoencoders
by
Neel Nanda
25d
ago
•
Applied to
ProLU: A Nonlinearity for Sparse Autoencoders
by
Noa Nabeshima
1mo
ago
•
Applied to
How to use and interpret activation patching
by
StefanHex
1mo
ago
•
Applied to
Past Tense Features
by
Can
1mo
ago
•
Applied to
[Full Post] Progress Update #1 from the GDM Mech Interp Team
by
Neel Nanda
1mo
ago
•
Applied to
[Summary] Progress Update #1 from the GDM Mech Interp Team
by
Neel Nanda
1mo
ago
•
Applied to
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
by
Raemon
1mo
ago
•
Applied to
Transformers Represent Belief State Geometry in their Residual Stream
by
Adam Shai
1mo
ago