This is my project proposal for Pivotal. Apply as a menteeby May 3rd
The field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)
This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:
We know the data-generating process (DGP) - what the bigram statistics, the skip-trigrams, induction are and how they interfere w/ each other.
Tensor-transformers makes compositionality clear-as-day (ie you can find relationships between any model components solely from the weights, whereas normal NNs require running data).
This is a transformer on a language task - results learned here straightforwardly apply to real LLMs
Modifiable complexity - we can change the complexity of the data, number of layers, width of model, etc (in general, we can easily train a bespoke model to verify a target hypothesis).
Specific Research Directions
Improve the DGP (data-generating process) - Extend the DGP to include computational patterns beyond n-grams (eg nested structure (brackets, quotes), long-range dependencies, context-sensitive transitions, etc) learned from existing datasets.
Interp-across-time - are there any dependent structures during training (eg must learn X before learning Y)? (Most similar to work by Naomi Saphra)
Building interp tools - What techniques (existing or novel) can be used to find these ground truth features?
Phenomenon Studies — use the controlled setup to characterize specific computational phenomena (suppression, error correction, compositional reuse) with ground-truth verification.
Tensor Interp - because we're using a tensor-transformer, there may be new techniques available to us (prior familiarity with tensor networks is a prerequisite for this direction)
High Level View
I'm shooting for a healthy feedback loop of:
Use existing computational vocab (eg induction) to make a toy LLM
Use (1) to improve our basic knowledge of models (eg suppression) and learn new computational vocab
Repeat
...
Profit
If we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.
If this interests you, do apply to my (& Thomas') research stream (by May 3rd).
Current Trained Model
As an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed:
There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for:
alice → sees(70%), helps(20%), finds(10%)
bob → knows(70%), likes(20%), meets(10%)
carol → calls(70%), tells(20%), sees(10%)
Etc
We can also look at the slice of the QK circuit:
For
Skip-bigrams (4 rules, max_skip=8):
beach ... big → at
garden ... old → and
lake ... new → or
office ... small → to
Zooming in, you can clearly see two here:
But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).
This is my project proposal for Pivotal. Apply as a mentee by May 3rd
The field has accumulated a vocabulary of computational primitives (induction heads, skip-trigrams) through post-hoc analysis. We propose building a toy language from these known primitives to train tensor-transformers (see an early example in the last section)
This allows us to study fundamental problems (suppression & error correction, compositionality/ circuits, dev-interp, etc) with the odds stacked in our favor:
Specific Research Directions
High Level View
I'm shooting for a healthy feedback loop of:
If we succeed enough loops of this process, this could work as a foundation for LLMs automating ambitious mech interp. In a sense, mech interp is already a verifiable task (ie find *simple* descriptions that replicate model behavior), but we need to resolve enough of our own confusions (& build better tools) first.
If this interests you, do apply to my (& Thomas') research stream (by May 3rd).
Current Trained Model
As an example, I’ve trained a 2-layer attn-only model. Looking at embed -> unembed:
There’s lots of apparent structure. Zooming into the Verb_T/NOUN square, you can see the bigram statistics for:
We can also look at the slice of the QK circuit:
For
Skip-bigrams (4 rules, max_skip=8):
Zooming in, you can clearly see two here:
But the other two are in the top-left & top-right boxes (they’re negative, yes, but this is bilinear attn. It’s actually a negative QK * a negative OV which ends up becoming positive).