larry-dial — LessWrong

How the NanoGPT Speedrun WR dropped by 20% in 3 months

I think we have different viewpoints of what the frontier is. The majority of the 20% improvements mentioned in this post are things I came up with and are pretty surface level. I have only been looking at LLMs for 6 months when I have free time outside work as something to tinker with, and I don't consider myself an expert, obviously. I would anticipate that the actual research frontier at labs is substantially ahead, such that any moral discussions around this post are akin to debating if a 11th grade Chemistry lab will encourage the creation of nuclear weapons.

I don't think you're doing a very good job of understanding capabilities

Part of my hope in posting was to get technical feedback from a crowd that is knowledgeable on AI systems. Curious if you can be more specific on why you believe this.

How the NanoGPT Speedrun WR dropped by 20% in 3 months

larry-dial1mo16-4

My current view is that alignment of advanced future AI systems will need to be approached from a large number of angles simultaneously: how public perception of AI is managed, how regulatory body's set incentives for research, how investors direct funds, and how researchers build thoughtful systems and anticipate change to model behavior. I believe I can best contribute by gaining a deep technical understanding of AI systems, such that I can better anticipate how changes to data/architecture/compute will impact behavior. Right now I find that exploring capabilities gives the strongest feedback signal to build this intuition, because the system immediately tells you when your next idea sucks, or when your intuition is off.

The Curious Case of the bos_token

larry-dial4mo10

Yes. To clarify further, dimension 447 is only scaled for the first position, since this is the only position where the massive activation occurs. The original line of reasoning was that for the activation to reach value 3000, at some point it was at value 2950 and the gradient pushed it higher. I wanted to better understand why the gradient would keep pushing it higher.

The chain of reasoning goes:

Fiddle with thing that is really big
Observe that attention to the bos_token increases across every layer, head, and position when I make the thing bigger.
Deduce this must imply a positive dot product between every query across all layers, heads, and positions with the bos_token key. In other words, only half of the query space is getting used across the model. Though from an information compression perspective, half of 1 dimension in 768 dimensional space could be considered small.
Conclude that this gradient pressure to drive the massive activation higher is coming from every downstream token, as downstream tokens try to find a sink for their attention.

One interesting follow on: when I downloaded the pretrained GPT2 model and continued training, the massive activation dropped from 3000 to 1000. Perhaps there is something going on with the momentum terms in the optimizer that is a factor in causing the massive activation, since the optimizer was reset when I started training. Open question: Could interventions on the momentum terms in the optimizer lead to improved training dynamics?

The Curious Case of the bos_token

larry-dial5mo30

Thanks for sharing! I found it interesting that the first layer attention was identifying single token sequences, and giving false positives on repeated sequences. Its like if an employee's job was to determine if he was the only person in the bar, and his heuristic is misfiring when everyone in the bar looks super similar to himself. The clustering extension here was neat. Since gpt2-small uses absolute positional embeddings, I did not observe this here, though I also did not specifically look for it.

The observation you made on key sink neurons aligns with the MLP plots above, where the major contributions to the massive activation comes from a small number of weights.

The results from the simple patching were also neat.

My main conclusion from both your paper and this analysis is that the model is bending over backwards to perform specific handling for an attention sink mechanism, which makes it brittle in contexts like repeated tokens. It is unclear to me why LLMs are not designated a specific set of parameters for modeling attention sinks.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments