x

LESSWRONG

LW

enricobottazzi — LessWrong

enricobottazzi

enricobottazzi

Message

Researcher - I write at leku.ink

47

2

2

2mo

enricobottazzi

Researcher - I write at leku.ink

Not all features are created equal

TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify...

How a failed experiment broke (and fixed) my view on feature labels

> TL;DR > > In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here. > > In the experiment, the labels are generated both via baez and eleuther_acts_top5 are compared across various evals. The...