I am low confidence on this, but I'd be pretty skeptical of 10x from optimizations. FlashAttention is a 3x or so for attention itself (it's less of a big deal on older hardware too) but the rest of a transformer is fairly efficient large matmuls. In general modern GPUs have lower utilization on each passing generation, and GPT-2 was trained on TPUs which famously have much higher utilization (~95% vs ~60% for a pure matmul). Possibly the whole thing was just done so inefficiently you could have gotten a 10x though.
I am low confidence on this, but I'd be pretty skeptical of 10x from optimizations. FlashAttention is a 3x or so for attention itself (it's less of a big deal on older hardware too) but the rest of a transformer is fairly efficient large matmuls. In general modern GPUs have lower utilization on each passing generation, and GPT-2 was trained on TPUs which famously have much higher utilization (~95% vs ~60% for a pure matmul). Possibly the whole thing was just done so inefficiently you could have gotten a 10x though.