brian kitano — LessWrong

Reminds me of [eigenwords](https://www.cis.upenn.edu/~ungar/eigenwords/) from way back when research was on word embeddings, which in makes me consider how word embedding work (Miklov etc) was like a useful precursor for interpretability in only mildly related seq2seq models (RNNs, GPTs, etc).

We're in a similar stage of research with interpretability - many mechinterp are variations on "take a word sequence, see what activates, do it at scale, cluster the activations;" this feels somewhat analogous to BoW/Skipgram approaches for word embeddings.

seq2seq models ended up prevailing over word embedding models as they learned latent representations of words implicitly (or you might say that seq2seq was a generalization of word embeddings), which begs the question: what will the generalizations be over current activation-based mech-interp, eg probes? I think [Anthropic replacement models](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) is a move that way.

More explicitly, the structural mapping is: (learn word embeddings explicitly) : (learn word embeddings latently from seq2seq) :: (learn activations explicitly) : (learn activations latently from replacement models).

The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable

brian kitano7mo10

Accountability Sinks

brian kitano7mo30

First, as already mentioned, formal processes are, more often than not, good and useful. They increase efficiency and safety. They act as a store of tacit organizational knowledge. Getting rid of them would make the society collapse.

You could argue, the judicial branch within the Constitution was designed to limit accountability (tenuring Supreme Court justices for life) so that they could judge freely; and that a societal amnesia is why we are skeptical of that policy again.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments