I would quibble with the framing of this piece - as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.
A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeof... (read more)
I would quibble with the framing of this piece - as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.
A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeof... (read more)