Has anyone done any reproduction of double descent [] on the transformers they train (or better, GPT-like transformers)? Since grokking can be somewhat understood by transformer interpretability [], this seems like a possibly tractable direction