This is really interesting!
I'm curious what happens if you relax the accumulating sum, allowing the model to "reposition" the tokens as it desires. Due to the causal mask the model already has access to the ordering (which is why NoPE is an effective positional encoding), but this might allow it to move related words near eachother, even if they are not adjacent in the sentence.
Moreover, you could provide multiple dimensions of "position" with which to do this, by say having a rotary encoding on the first half of the latent vectors and then a separate in...
Did you use a learning rate schedule? Cosine anneal? If not: probably should. If so: the loss/perplexity/bpc would always appear to plateau as you finish the schedule, it doesn't imply more training wouldn't be beneficial...
Interesting! Currently, you are deciding the increment to the next token... using which activations? Post transformer? Post MLP?
It seems like maybe what you're looking for is, for each layer, determine the increment from the previous token after the previous layer's MLP (for the current token). Since layer 0 you have no previous layer to reference, maybe it just uses standard RoPE, or maybe you do something similar to what you're doing now where you determine the layer 0 increment based on the end representation of the previous token.