LESSWRONG
LW

1252
152334H
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
What o3 Becomes by 2028
152334H10mo30

[minor technical disputes below; ignore if disinterested]

This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.

Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large.

I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.

In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.

For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants.

I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.

Reply