This is a special post for quick takes by Arthur Conmy. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
1 comment, sorted by Click to highlight new comments since: Today at 7:56 AM

Has anyone done any reproduction of double descent [] on the transformers they train (or better, GPT-like transformers)? Since grokking can be somewhat understood by transformer interpretability [], this seems like a possibly tractable direction