This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Interpretability (ML & AI)
•
Applied to
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
by
likenneth
7d
ago
•
Applied to
Exploring Concept-Specific Slices in Weight Matrices for Network Interpretability
by
Raemon
7d
ago
•
Applied to
Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm
by
Spencer Becker-Kahn
15d
ago
•
Applied to
Announcing Apollo Research
by
Marius Hobbhahn
17d
ago
•
Applied to
Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor"
by
Roman Leventov
17d
ago
•
Applied to
The king token
by
p.b.
19d
ago
•
Applied to
Why and When Interpretability Work is Dangerous
by
NicholasKross
20d
ago
•
Applied to
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
by
StefanHex
22d
ago
•
Applied to
[Linkpost] Interpretability Dreams
by
DanielFilan
23d
ago
•
Applied to
'Fundamental' vs 'applied' mechanistic interpretability research
by
Lee Sharkey
24d
ago
•
Applied to
Activation additions in a small residual network
by
Raemon
25d
ago
•
Applied to
Gender Vectors in ROME’s Latent Space
by
Xodarap
1mo
ago
•
Applied to
A Mechanistic Interpretability Analysis of a GridWorld Agent-Simulator (Part 1 of N)
by
Joseph Bloom
1mo
ago
•
Applied to
My current workflow to study the internal mechanisms of LLM
by
Yulu Pi
1mo
ago
•
Applied to
Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)
by
Scott Emmons
1mo
ago
•
Applied to
Input Swap Graphs: Discovering the role of neural network components at scale
by
Alexandre Variengien
1mo
ago
•
Applied to
AI interpretability could be harmful?
by
Roman Leventov
1mo
ago
•
Applied to
New OpenAI Paper - Language models can explain neurons in language models
by
Raemon
1mo
ago