This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Interpretability (ML & AI)
•
Applied to
Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm
by
Spencer Becker-Kahn
5d
ago
•
Applied to
Announcing Apollo Research
by
Marius Hobbhahn
7d
ago
•
Applied to
Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor"
by
Roman Leventov
7d
ago
•
Applied to
The king token
by
p.b.
9d
ago
•
Applied to
Why and When Interpretability Work is Dangerous
by
NicholasKross
10d
ago
•
Applied to
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
by
StefanHex
12d
ago
•
Applied to
[Linkpost] Interpretability Dreams
by
DanielFilan
13d
ago
•
Applied to
'Fundamental' vs 'applied' mechanistic interpretability research
by
Lee Sharkey
14d
ago
•
Applied to
Activation additions in a small residual network
by
Raemon
15d
ago
•
Applied to
Gender Vectors in ROME’s Latent Space
by
Xodarap
16d
ago
•
Applied to
A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)
by
Joseph Bloom
21d
ago
•
Applied to
My current workflow to study the internal mechanisms of LLM
by
Yulu Pi
21d
ago
•
Applied to
Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)
by
Scott Emmons
23d
ago
•
Applied to
Input Swap Graphs: Discovering the role of neural network components at scale
by
Alexandre Variengien
25d
ago
•
Applied to
AI interpretability could be harmful?
by
Roman Leventov
1mo
ago
•
Applied to
New OpenAI Paper - Language models can explain neurons in language models
by
Raemon
1mo
ago
•
Applied to
AGI-Automated Interpretability is Suicide
by
__RicG__
1mo
ago
•
Applied to
Have you heard about MIT's "liquid neural networks"? What do you think about them?
by
Ppau
1mo
ago