LESSWRONG
LW

joseph_c's Shortform

by joseph_c
23rd Aug 2024
1 min read
1

3

This is a special post for quick takes by joseph_c. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
joseph_c's Shortform
4joseph_c
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 7:07 AM
[-]joseph_c1y40

I recently came across Backpack Language Models and wanted to share it in case any AI interpretability people have not seen it. (I have yet to see this posted on LessWrong.)

The main difference between a backpack model and an LLM is that it enforces a much stricter rule to map inputs' embeddings to output logits. Most LLMs allow the output logits to be an arbitrary function of the inputs' embeddings; a backpack model requires the output logits to be a linear transformation of a linear combination of the input embeddings. The weights for this linear combination are parameterized by a transformer.

The nice thing about backpack models is that they are somewhat easier to interpret/edit/control: The output logits are a linear combination of the inputs' embeddings, so you can directly observe how changing the embeddings changes the outputs.

Reply
Moderation Log
More from joseph_c
View more
Curated and popular this week
1Comments