152334H — LessWrong

[minor technical disputes below; ignore if disinterested]

This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.

Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large.

I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.

In general, I don't understand linking scaling difficulties to max scale-up world size. I believe the bandwidth/latency of IB H100 clusters does not present a hard problem for current hyperscalers on other parallelisms.

For H100, that's only 8 GPUs in the standard configuration that seems to be used everywhere. For TPUv6e, that's a whole 256-chip pod, and this wasn't a constraint in older TPUs either. For Trn2, that's either 16 or 64 GPUs in either standard or Ultra variants.

I think it's plausible the combination of torus topology + poor PCIe5.0 bw/latency will make a full TP=64 Trn2 config underform your expectations, but we may have to wait for Semianalysis to provide good numbers on this.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments