How Much Internal Structure Leaks Through a Language Model's Outputs?
How Much Internal Structure Leaks Through a Language Model's Outputs? TL;DR I ran a simple experiment: take GPT-2, feed it 100 diverse texts, collect only the output logits (no access to weights or activations), and try to predict the model's internal activation structure from those outputs alone. Result: the principal...
Mar 151