[minor technical disputes below; ignore if disinterested]
This might be an issue when training on H100 at this scale[1] and explain some scaling difficulties for labs that are not Google, or Anthropic later in 2025 once the Trn2 cluster becomes useful.
Llama 3 405B was trained in minibatches with 2K sequences of 8K tokens, the smallest that 8-GPU scale-up domains of a 16K H100 cluster enable. If it was clearly optimal for minibatches to be larger, it's trivial to make it so, so they are probably already too large.
I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don't understand linking scaling difficulties... (read more)
[minor technical disputes below; ignore if disinterested]
I'm a bit confused by this part. I believe the l3 paper indicates the training seqlen was increased mid-training.
In general, I don't understand linking scaling difficulties... (read more)