You want to relay the contents of a transformer's output vector to the next input: next_input = encode(decode(output)).
You're currently using next_input = embed(sample_token(output)) to do this.
This compresses the output to one row of a lookup table. That's a pretty big squash. Too big, surely -- there must be a better way.
Enter neuralese. If you make next_input=output (or something like that) you lose no bandwidth at all. The bad news is that these vectors no longer correspond to natural language.
But you can add more bandwidth by funneling each vector through more text. That way, encode(decode(output)) doesn't lose too much information.
You could have each vector decode to multiple tokens. Or even a cleverly chosen... (read more)
It shouldn't be any different than a normal LLM, because only the architecture changes and not the objective.
I think you're proposing a system where the model spits out several independently sampled/unordered tokens at a time. I agree that this would likely not converge to something human readable.
What I'm suggesting is that the main model's output vectors be given to a much more lightweight decoder, which converts it into natural language (probably autoregressively), as in the two papers linked. These architectures should be able to slot in wherever normal token-based models are used today.
On a separate note: I mentioned essays-per-vector to illustrate that bandwidth concerns don't make sense to me, but I don't... (read more)