This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Interpretability (ML & AI)
•
Applied to
LLM/AI hype
by
Student192837465
3h
ago
•
Applied to
Rational Animations' intro to mechanistic interpretability
by
Writer
2d
ago
•
Applied to
Introducing SARA: a new activation steering technique
by
Alejandro Tlaie
6d
ago
•
Applied to
Exploring Llama-3-8B MLP Neurons
by
ntt123
6d
ago
•
Applied to
"What the hell is a representation, anyway?" | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents
by
IwanWilliams
7d
ago
•
Applied to
Closed-Source Evaluations
by
Jono
7d
ago
•
Applied to
Alignment Gaps
by
kcyras
8d
ago
•
Applied to
Relationships among words, metalingual definition, and interpretability
by
Bill Benzon
8d
ago
•
Applied to
SAEs Discover Meaningful Features in the IOI Task
by
Neel Nanda
10d
ago
•
Applied to
graphpatch: a Python Library for Activation Patching
by
Occam's Laser
11d
ago
•
Applied to
Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity.
by
Josh Levy
11d
ago
•
Applied to
Comments on Anthropic's Scaling Monosemanticity
by
Robert_AIZI
12d
ago
•
Applied to
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
by
Erik Jenner
15d
ago
•
Applied to
Apollo Research 1-year update
by
Marius Hobbhahn
17d
ago
•
Applied to
Finding Backward Chaining Circuits in Transformers Trained on Tree Search
by
abhayesian
19d
ago
•
Applied to
SAE sparse feature graph using only residual layers
by
crayhippo
23d
ago
•
Applied to
Announcing Human-aligned AI Summer School
by
Jan_Kulveit
25d
ago
•
Applied to
Anthropic announces interpretability advances. How much does this advance alignment?
by
Seth Herd
25d
ago