x

LESSWRONG

LW

davmre — LessWrong

davmre

davmre

Message

1

3y

davmre

3y

You can’t imitation-learn how to continual-learn

I would quibble with the framing of this piece - as you note, the problem is not imitation learning itself (in principle, it can work!) but the limits of the standard transformer architecture.

A specific limit you could point to, to make this argument stronger, is that a depth-D transformer can implement at most O(D) steps of gradient descent, no matter how long its context window is. I think this is underappreciated. By design, transformers cannot perform sequential computation along the input dimension, only the depth dimension. This is their main tradeof... (read more)