LESSWRONGTags
LW

Interpretability (ML & AI)

•

Applied to LLM/AI hype by Student192837465 3h ago

•

Applied to Rational Animations' intro to mechanistic interpretability by Writer 2d ago

•

Applied to Introducing SARA: a new activation steering technique by Alejandro Tlaie 6d ago

•

Applied to Exploring Llama-3-8B MLP Neurons by ntt123 6d ago

•

Applied to "What the hell is a representation, anyway?" | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents by IwanWilliams 7d ago

•

Applied to Closed-Source Evaluations by Jono 7d ago

•

Applied to Alignment Gaps by kcyras 8d ago

•

Applied to Relationships among words, metalingual definition, and interpretability by Bill Benzon 8d ago

•

Applied to SAEs Discover Meaningful Features in the IOI Task by Neel Nanda 10d ago

•

Applied to graphpatch: a Python Library for Activation Patching by Occam's Laser 11d ago

•

Applied to Is This Lie Detector Really Just a Lie Detector? An Investigation of LLM Probe Specificity. by Josh Levy 11d ago

•

Applied to Comments on Anthropic's Scaling Monosemanticity by Robert_AIZI 12d ago

•

Applied to Evidence of Learned Look-Ahead in a Chess-Playing Neural Network by Erik Jenner 15d ago

•

Applied to Apollo Research 1-year update by Marius Hobbhahn 17d ago

•

Applied to Finding Backward Chaining Circuits in Transformers Trained on Tree Search by abhayesian 19d ago

•

Applied to SAE sparse feature graph using only residual layers by crayhippo 23d ago

•

Applied to Announcing Human-aligned AI Summer School by Jan_Kulveit 25d ago

•

Applied to Anthropic announces interpretability advances. How much does this advance alignment? by Seth Herd 25d ago