Sandy Fraser — LessWrong

Regarding generalization to transformers, I suspect that:

Representations "want" to be well-structured. We see this in the way concepts tend to cluster together, and it's further evidenced by cosine distance being a useful thing to measure.
Well-structured latent spaces compress knowledge more efficiently, or are otherwise better suited for embedding math. Weak evidence: the training speed boost from hypersphere normalization in nGPT.

So I think latent representations naturally tend toward having many of the features we would regularize for, and they may only need a gentle nudge to become much more interpretable.

I think that some of the challenges in mech interp, intervention, and unlearning are due to:

Not knowing where concepts are located, and having to look for them
Some concepts becoming poorly-separated (entangled) due to the initial state and training dynamics
Not knowing how entangled they are.

My hypothesis is that: If we a. Identify a few concepts that we really care about (like "malice"), b. Label a small subset of malicious samples, c. Apply gentle regularization to all latent representations in the transformer for tokens so-labelled, right from the start of training; Then for the concepts that we care about, the structure will become well-organized in ways that we can predict, while other concepts will be largely free to organize however they like.

And I think that in high-dimensional spaces, this won't be in conflict with nuanced concepts that "share" components of several more basic concepts. For example, in the experiment presented in this post, red and the colors near red were all regularized (with varying frequency) toward the anchor point — and yet the colors near red (such as dark red and light red) were able to position themselves appropriately close to the anchor point while also having enough freedom to be shifted toward white and black in the other (non-hue) dimensions.

Annotated diagram from the post above, showing two circular scatter plots of latent space. Each plot shows one hue dimension and one "other" (undefined) dimension. Pure red is in the center of each plot, at coordinates (1,0,0,0). White is at the top of the first, and black is near the bottom of the second. Light red is somewhere between red and white on the first plot, and dark red is somewhere between red and black on the second plot.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments