This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Interpretability (ML & AI)
•
Applied to
Adam Optimizer Causes Privileged Basis in Transformer Language Models
by
Diego Caples
2d
ago
•
Applied to
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
by
Daniel Lee
3d
ago
•
Applied to
Automating LLM Auditing with Developmental Interpretability
by
Raemon
4d
ago
•
Applied to
Redundant Attention Heads in Large Language Models For In Context Learning
by
skunnavakkam
7d
ago
•
Applied to
Can Large Language Models effectively identify cybersecurity risks?
by
emile delcourt
9d
ago
•
Applied to
Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
by
Winnie Yang
9d
ago
•
Applied to
[Paper] Measuring Visual Sycophancy in Multimodal Models
by
Jaehyuk Lim
12d
ago
•
Applied to
AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
by
DanielFilan
15d
ago
•
Applied to
Understanding Hidden Computations in Chain-of-Thought Reasoning
by
rokosbasilisk
15d
ago
•
Applied to
Showing SAE Latents Are Not Atomic Using Meta-SAEs
by
Bart Bussmann
16d
ago
•
Applied to
Deception and Jailbreak Sequence: 1. Iterative Refinement Stages of Deception in LLMs
by
Winnie Yang
16d
ago
•
Applied to
Establishing a Connection (Ch. 5-8)
by
a littoral wizard
16d
ago
•
Applied to
Crafting Polysemantic Transformer Benchmarks with Known Circuits
by
Evan Anders
16d
ago
•
Applied to
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
by
Kola Ayonrinde
16d
ago
•
Applied to
Measuring Structure Development in Algorithmic Transformers
by
Raemon
17d
ago
•
Applied to
What's going on with Per-Component Weight Updates?
by
4gate
18d
ago
•
Applied to
Finding Deception in Language Models
by
Esben Kran
19d
ago
•
Applied to
Biases in Biases, or Critique of the Critique
by
ThePathYouWillChoose
21d
ago
•
Applied to
Calendar feature geometry in GPT-2 layer 8 residual stream SAEs
by
Patrick Leask
23d
ago