Aryaman Arora — LessWrong

Mech Interp Wiki Page and Why You Should Edit Wikipedia

Don't usually post here but feel compelled to do so after seeing this. This post specifically is being cited to as causing a "conflict of interest" on the talk page of the article https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing. I substantially edited the mech interp wiki page before this (I believe a majority of the bytes of the page are mine now) but some of my contributions are being removed for e.g. citing arXiv papers that are apparently not good sources (never mind them being highly cited and used by others i... (read more)

MATS Applications + Research Directions I'm Currently Excited About

Aryaman Arora1yΩ020

Very useful list Neel!! Thanks for mentioning AxBench, but unfortunately we don't own the domain you linked to 😅 the actual link is https://github.com/stanfordnlp/axbench

The ‘strong’ feature hypothesis could be wrong

Aryaman Arora2y92

cf. https://arxiv.org/abs/2407.14662

Some common confusion about induction heads

Aryaman Arora3y42

Really nice summarisation of the confusion. Re: your point 3, this point makes "induction heads" as a class of things feel a lot less coherent :( I had also not considered that the behaviour on random sequences to show induction as a fallback--do you think there may be induction-y heads that simply don't activate on random sequences due to the out-of-distribution nature of them?

SolidGoldMagikarp (plus, prompt generation)

Aryaman Arora3y60

I'll just preregister that I bet these weird tokens have very large norms in the embedding space.

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora4yΩ010

Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.

Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora4yΩ130

I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. $softmax (W_{E} W_{U}) = softmax (W_{E} W_{E}^{T})$ ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product $x \cdot y = ∥ x ∥ ∥ y ∥ cos (θ)$ the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first a... (read more)

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora4yΩ010

Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:

the backup heads could have other main functions but incidentally are useful for the specific task we're looking at, so they end up taking the place of the main heads
thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren't interpretable in big models due to superposition

Re: GPT-Neo being weird, one o... (read more)

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Aryaman Arora4yΩ130

Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs

GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.