larry-dial

Finding the uncertainty vector in GPT2-scale transformers

3mo

In this post I explore a phenomena in LLMs where the training process naturally consolidates information in a highly interpretable structure in the residual stream, through a positive feedback loop from a small variation at initialization. I start with a toy example and work up to GPT2 scale, showing animations of how weights and activations evolve over training. I assume familiarity with the transformer architecture.

The exploratory tone of this post will likely lead to more questions than answers. The intended audience is people hoping to learn more about transformer internals and their dynamics over training. The motivating question being "What is going on in this GPT2-scale model because these charts look incredibly... (read 2803 more words →)

Replying toHow the NanoGPT Speedrun WR dropped by 20% in 3 months

larry-dial4mo

How the NanoGPT Speedrun WR dropped by 20% in 3 months

I think we have different viewpoints of what the frontier is. The majority of the 20% improvements mentioned in this post are things I came up with and are pretty surface level. I have only been looking at LLMs for 6 months when I have free time outside work as something to tinker with, and I don't consider myself an expert, obviously. I would anticipate that the actual research frontier at labs is substantially ahead, such that any moral discussions around this post are akin to debating if a 11th grade Chemistry lab will encourage the creation of nuclear weapons.

I don't think you're doing a very good job of understanding capabilities

Part of my hope in posting was to get technical feedback from a crowd that is knowledgeable on AI systems. Curious if you can be more specific on why you believe this.

Replying toHow the NanoGPT Speedrun WR dropped by 20% in 3 months

larry-dial4mo

How the NanoGPT Speedrun WR dropped by 20% in 3 months

My current view is that alignment of advanced future AI systems will need to be approached from a large number of angles simultaneously: how public perception of AI is managed, how regulatory body's set incentives for research, how investors direct funds, and how researchers build thoughtful systems and anticipate change to model behavior. I believe I can best contribute by gaining a deep technical understanding of AI systems, such that I can better anticipate how changes to data/architecture/compute will impact behavior. Right now I find that exploring capabilities gives the strongest feedback signal to build this intuition, because the system immediately tells you when your next idea sucks, or when your intuition is off.

-4

•••

How the NanoGPT Speedrun WR dropped by 20% in 3 months

larry-dial

4mo

In early 2024 Andrej Karpathy stood up an llm.c repo to train GPT-2 (124M), which took an equivalent of 45 minutes on 8xH100 GPUs to reach 3.28 cross entropy loss. By Jan 2025, collaborators of modded-nanogpt brought that time down to 3 minutes. It sat near 3 minutes until July 2025, having a large swath of optimization already applied: RoPE, value embeddings, reduce scatter grad updates, Muon, QK Norm, Relu^2, a custom FP8 head, skip connections, flex attention, short-long windows, attention window warmup, linear lr cooldown, and more. Yet, in the last 3 months the record has fallen by another 20% to 2 minutes and 20 seconds.

Many of the improvements in the... (read 2418 more words →)

•••

Replying toThe Curious Case of the bos_token

larry-dial6mo

The Curious Case of the bos_token

Yes. To clarify further, dimension 447 is only scaled for the first position, since this is the only position where the massive activation occurs. The original line of reasoning was that for the activation to reach value 3000, at some point it was at value 2950 and the gradient pushed it higher. I wanted to better understand why the gradient would keep pushing it higher.

The chain of reasoning goes:

Fiddle with thing that is really big
Observe that attention to the bos_token increases across every layer, head, and position when I make the thing bigger.
Deduce this must imply a positive dot product between every query across all layers, heads, and positions with the bos_token

larry-dial8mo

The Curious Case of the bos_token

Thanks for sharing! I found it interesting that the first layer attention was identifying single token sequences, and giving false positives on repeated sequences. Its like if an employee's job was to determine if he was the only person in the bar, and his heuristic is misfiring when everyone in the bar looks super similar to himself. The clustering extension here was neat. Since gpt2-small uses absolute positional embeddings, I did not observe this here, though I also did not specifically look for it.

The observation you made on key sink neurons aligns with the MLP plots above, where the major contributions to the massive activation comes from a small number of weights.

The... (read more)

The Curious Case of the bos_token

larry-dial

8mo

Edit: Simple application of these ideas achieved the GPT2 Speed Running world record of 171 seconds on 8H100 GPUs. https://github.com/KellerJordan/modded-nanogpt.

LLMs process inputs as a sequence of tokens. Typically, a dummy token is prepended to the sequence, known as the bos_token (beginning of sequence token).

Input: "Good morning"
Token strings: ['<|bos_token|>', 'Good', ' morning']
Tokens: [50256, 10248, 3329]

Though the bos_token passes through the same MLP and attention weights as all other tokens, it exhibits distinct emergent behavior. Its activations are several orders of magnitude larger than other tokens, and it often receives over 50% of the attention from downstream positions.

Why is this token so critical? What drives this emergent behavior, and does it indicate architectural constraints... (read 2756 more words →)

LESSWRONG
LW

LESSWRONG
LW

larry-dial

larry-dial

Finding the uncertainty vector in GPT2-scale transformers

How the NanoGPT Speedrun WR dropped by 20% in 3 months

The Curious Case of the bos_token

larry-dial

larry-dial

larry-dial

Finding the uncertainty vector in GPT2-scale transformers

How the NanoGPT Speedrun WR dropped by 20% in 3 months

The Curious Case of the bos_token