Use more text than one token to avoid neuralese
You want to relay the contents of a transformer's output vector to the next input: next_input = encode(decode(output)). You're currently using next_input = embed(sample_token(output)) to do this. This compresses the output to one row of a lookup table. That's a pretty big squash. Too big, surely -- there must be...
Feb 1310