Do you have an intuition for how hard it would be to keep the multiple token outputs human-understandable? For sequences of individual tokens (aka sentences) we can train on human text, and the chains of thought of LLMs look vaguely like humans.
For sequences of groups of tokens (and eventually sequences of essays) I’m uncertain how much the results are human-understandable. An argument against it might be: sequences of groups of tokens (parallel sentences?) are a novel modality and LLMs will make up some language, but it may not be very human-like because there is no human “parallel sentence” language.
It shouldn't be any different than a normal LLM, because only the architecture changes and not the objective.
I think you're proposing a system where the model spits out several independently sampled/unordered tokens at a time. I agree that this would likely not converge to something human readable.
What I'm suggesting is that the main model's output vectors be given to a much more lightweight decoder, which converts it into natural language (probably autoregressively), as in the two papers linked. These architectures should be able to slot in wherever normal token-based models are used today.
On a separate note: I mentioned essays-per-vector to illustrate that bandwidth concerns don't make sense to me, but I don't find that scenario plausible. My intuition is that attention is probably better than an MLP at learning symmetries in language or otherwise, so an essay would always be best represented by more than one vector, keeping total dimensionality or compute constant.
I think it's plausible that this effect is strong enough to keep the optimal amount of text per vector really quite small, especially given that Byte Latent Transformers and their ilk don't seem like magic bullets, but then we wouldn't need to worry about the bandwidth thing anyway. Of course, empirical evidence will be needed here, like a scaling law for patch size.
You want to relay the contents of a transformer's output vector to the next input: next_input = encode(decode(output)).
You're currently using next_input = embed(sample_token(output)) to do this.
This compresses the output to one row of a lookup table. That's a pretty big squash. Too big, surely -- there must be a better way.
Enter neuralese. If you make next_input=output (or something like that) you lose no bandwidth at all. The bad news is that these vectors no longer correspond to natural language.
But you can add more bandwidth by funneling each vector through more text. That way, encode(decode(output)) doesn't lose too much information.
You could have each vector decode to multiple tokens. Or even a cleverly chosen patch of bytes. Perhaps whole sentences, paragraphs, essays one day.
I don't know which is best, but you do have options here, so it's not obvious to me that "you need more bandwidth" implies "you must abandon natural language" -- unless you're forcing it through a text intermediate as impoverished as, well, a lookup table.
Edit: Nothing new under the sun, it seems. Please look here for prior discussion of this topic, and consider this post a bump for it.