LESSWRONG
LW

Wikitags

Interpretability (ML & AI)

Edited by niplav, Multicore, et al. last updated 22nd Jan 2025

Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.

A prominent subfield of interpretability of neural networks is mechanistic interpretability, which attempts to understand how neural networks perform the tasks they perform, for example by finding circuits in transformer models. This can be contrasted to subfieds of interpretability which seek to attribute some output to a part of a specific input, such as clarifying which pixels in an input image caused a computer vision model to output the classification "horse".

See Also

  • Explainable Artificial Intelligence on Wikipedia
  • Transformer Circuits
  • Interpretable Machine Learning, textbook

Research

  • Circuits Thread
  • Transformer Circuits Thread
Subscribe
6
Subscribe
6
Discussion0
Discussion0
Posts tagged Interpretability (ML & AI)
61A small update to the Sparse Coding interim research report
Ω
Lee Sharkey, Dan Braun, beren
2y
Ω
5
53Interpretability in ML: A Broad Overview
Ω
[anonymous]5y
Ω
5
173Timaeus's First Four Months
Ω
Jesse Hoogland, Daniel Murfet, Stan van Wingerden, Alexander Gietelink Oldenziel
2y
Ω
6
374A Mechanistic Interpretability Analysis of Grokking
Ω
Neel Nanda, Tom Lieberum
3y
Ω
48
206Toward A Mathematical Framework for Computation in Superposition
Ω
Dmitry Vaintrob, jake_mendel, Kaarel
2y
Ω
19
153[Interim research report] Taking features out of superposition with sparse autoencoders
Ω
Lee Sharkey, Dan Braun, beren
3y
Ω
23
208Chris Olah’s views on AGI safety
Ω
evhub
6y
Ω
40
127A Longlist of Theories of Impact for Interpretability
Ω
Neel Nanda
3y
Ω
41
127Re-Examining LayerNorm
Ω
Eric Winsor
3y
Ω
12
106200 Concrete Open Problems in Mechanistic Interpretability: Introduction
Ω
Neel Nanda
3y
Ω
0
76A Problem to Solve Before Building a Deception Detector
Ω
Eleni Angelou, lewis smith
7mo
Ω
12
33Finding Neurons in a Haystack: Case Studies with Sparse Probing
Ω
wesg, Neel Nanda
2y
Ω
6
305Tracing the Thoughts of a Large Language Model
Ω
Adam Jermyn
4mo
Ω
24
212How To Go From Interpretability To Alignment: Just Retarget The Search
Ω
johnswentworth
3y
Ω
34
200The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Ω
beren, Sid Black
3y
Ω
34
Load More (15/840)
Add Posts