How Far Apart Does a Model Think Its Tokens Are?

Brendan Long

Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has minimal effect on model performance but allows us to see what the model thinks the distance is between each position and how this varies per-layer.

Six rows, one per layer (L0–L5), showing the Marie Curie sentence with characters spaced by each layer's learned increments, each row scaled independently. Faint lines connect each character across layers; gaps at spaces, the hyphen, and commas shift from layer to layer. — Example sentence with each character plotted based on per-layer learned position increments. Note the clear punctuation-based boundaries in L0 and what looks like concept-based grouping in L3.

I think this might be useful as another technique to inspect "where the model is looking" in addition to plotting attention patterns (and with similar limitations). The patterns can also hint at what the model is looking for at each layer (when position increments match different kinds of boundaries).

Note: This is still partially a solution in search of a problem. I'm hoping to help with the "searching under lamp posts" problem by finding more lamp posts, but there's additional work to be done here to see if this is actually useful or just a novelty.

AI disclaimer: The Architecture, Learned Position Increments, and Related Work sections were originally drafted by Claude before being (heavily) human-edited.

Introduction

Standard LLMs use Rotary Position Embeddings (RoPE) to encode the location of each position by rotating the key and query vectors by angles proportional to the number of tokens between the two positions.

Standard RoPE assumes that each token advances the position counter by +1, but we can train a model to advance the position counter by a learned increment per-token. Going further, we can learn a per-layer position increment vector, allowing us to calculate content-based position increments at any layer of the model.

Method

Architecture

The models are small decoder-only transformers — 256-dimensional, 8 heads, 6 layers, ~6.4M parameters, with RMSNorm, SwiGLU MLPs, and RoPE (θ = 10,000) — directly on raw UTF-8 bytes rather than BPE tokens. The vocabulary is 257 symbols: 256 byte values plus a document separator.

I focus on byte-level transformers because they need to find their own word boundaries, which makes the early-layer behavior more interesting. This technique also works on BPE models, but the per-token position increments aren't as interesting since some aggregation has already been done by the tokenizer.

Learned position increments

Standard RoPE advances the position counter by +1 per token and rotates each query and key by an angle proportional to that position. I replace the fixed +1 with a learned, per-token increment. A small MLP (Linear → GELU → Linear → softplus) reads a token's hidden state and emits a strictly positive increment δ. I call this DeltaMLP.

A token's position is the running sum of the increments up to and including it, and I apply the ordinary RoPE rotation using the calculated position.

I initialize the MLP's output bias so that δ ≈ 1 everywhere, so each model starts as exact integer-position RoPE and any deviation is learned. Because positions are still a cumulative sum, the rotation between a query and a key continues to only depend on the difference between their learned positions.

Flowchart of a transformer language model with learned position increments. Token IDs flow through a token embedding into a stack of 6 transformer layers, then a final RMSNorm and tied-weight unembedding to produce logits. Inside each layer, the input forks: the main path goes through a standard pre-norm attention block and SwiGLU FFN with residual connections, while a side branch feeds a Delta MLP that outputs one positive scalar per token. These scalars are cumulatively summed into real-valued positions, converted to RoPE angles, and used to rotate Q and K in that layer's attention — replacing the integer position index. Each layer has its own Delta MLP, and increments initialize near 1 so training starts as standard RoPE.

The idea of learning positional increments isn't unique or novel. See Related Work for other papers which have tried similar things (generally for capabilities reasons).

I study two variants:

Shared: one DeltaMLP reads the token embeddings, so δ depends only on the token and is identical at every layer.
Per-layer: each layer has its own DeltaMLP that reads that layer's hidden state, so δ varies per-layer and takes the full residual into account. Hidden-state norms grow with depth, so for stability I RMSNorm the input and use a sigmoid to bound the max increment to max_delta = 10.

Data and training

I train on one epoch of an even mix of English and Chinese Wikipedia (wikimedia/wikipedia configs 20231101.en and 20231101.zh) at a 512-byte context length, with a held-out validation split drawn from disjoint documents. Each model trains for 50k steps with AdamW (learning rate 1e-3, weight decay 0.01, cosine schedule, gradient clipping) in bf16. For the loss comparison I train standard RoPE and both shared and per-layer learned increment RoPE, under identical settings.

Chinese characters are represented in UTF-8 as a lead byte (0xE4–0xE9) followed by two continuation bytes, so I predicted that English capital letters and Chinese lead bytes would be treated similarly by the models.

Results

Per-Token Increments

On the bilingual English and Chinese language model, I found that the models learned smaller increments for lowercase characters and word-internal bytes and larger increments for uppercase letters, start-of-word bytes, punctuation and other boundaries.

Category	Examples	Learned Increment δ
English (lowercase)	a-z	0.68–0.96 (mean 0.79)
Chinese (continuation byte)	`0x80–0xBF`	0.73–0.86 (mean 0.80)
Chinese (lead byte)	`0xE4–0xE9`	0.84–0.98 (mean 0.92)
Word boundary	space	1.05
English (uppercase)	A-Z	1.01–1.29 (mean 1.10)
Punctuation	. , ; ! ?	1.10–1.29 (mean 1.18)
Line boundary	newline	2.12
Other boundaries	EOS	2.90

English uppercase letters and Chinese lead bytes both show larger gaps than lowercase and continuation bytes. Since Chinese lead bytes are significantly more common than uppercase letters, it makes sense that the model seems to consider uppercase to be a stronger signal of a boundary.

If we plot each character spaced by their relative position increments, we can visually see how close the model thinks characters are together:

The sentence "Marie Curie, a Polish-born physicist, won two Nobel Prizes." shown in a single row, with each character placed so that the gap to its left is proportional to its learned position increment. Lowercase letters within a word sit close together; capital letters and punctuation sit a little farther from the character before them; and the spaces between words are the widest gaps.

In Chinese, we (unfortunately) can't display individual bytes so we sum the increments for each character, causing the average character spacing to be very uniform with no obvious word boundaries.

The Chinese sentence '人工智能是计算机科学的一个分支。' with each character placed at its learned position; the sixteen characters are spaced almost perfectly evenly. — According to Claude, this sentence translates to, "Artificial intelligence is a branch of computer science."

First Layer of Per-Layer Model

On the per-layer model, I found that the learned positions tended to explode by default, so I bounded them to max_delta = 10.

The model trained with that architecture found larger increments but shows the same pattern as the shared-MLP model for the first layer.

Category	Examples	Learned Increment δ (L0)
English (lowercase)	a-z	1.21–2.53 (mean 1.64)
Chinese (continuation byte)	`0x80–0xBF`	1.57–2.08 (mean 1.79)
Chinese (lead byte)	`0xE4–0xE9`	2.04–2.72 (mean 2.43)
English (uppercase)	A-Z	2.87–9.98^[1] (mean 9.52)
Punctuation	. , ; ! ?	9.80–9.98 (mean 9.90)
Other boundaries	EOS	9.82
Word boundary	space	9.99
Line boundary	newline	9.99

Chinese Word Boundaries

Since Chinese doesn't have spaces between words, I was interested to see if the model would learn word boundaries from Chinese text without punctuation, so I ran my per-layer model on held-out text from Chinese Wikipedia and compared my learned increments to word boundaries detected by jieba (a Chinese word segmenter).

I measured how well the learned increment at each layer separates true word boundaries from non-boundaries, as an ROC-AUC (0.5 = chance, 0.0 or 1.0 = perfect). I score only the gaps between two Chinese characters (no space or punctuation), using the increment at the next character's leading byte.

Layer (increment computed from)	Chinese word-boundary AUC
L0 (byte identity)	0.50 (chance)
L1	0.54
L2	0.68
L3	0.37
L4	0.63
L5	0.47

The first layer is unable to detect word boundaries since it only sees the byte's embedding and has no contextual information, but the middle layers (L2–L4) are able to distinguish word boundaries (although L3 seems to be compressing boundaries rather than expanding them).

Per-Layer Plots

We plot the same sentences from above but using per-layer position increments. Each layer is scaled independently to make the results legible.

Six rows (L0–L5) of the Chinese sentence. Layer 0 is nearly uniform; the middle layers pack the multi-character words 人工智能 and 计算机科学 closer together and open small gaps around the single-character words 是 and 的 and before the period. — The structure is hard to see, but jieba segments this as 人工智能 / 是 / 计算机科学 / 的 / 一个 / 分支 / 。, and the model seems to be recovering some of the gaps well (especially in L2 and later).

If we remove the per-layer normalization, we can also see that later layers want smaller position increments.

The same six-layer plot of the Marie Curie sentence with all layers on one shared scale. Early layers (L0–L2) spread the characters across a wide span while later layers (L3–L5) compress them into a much shorter one. — The same Marie Curie sentence above with all increments displayed on the same scale.

Grouping Multi-word Entities

The plots above made me wonder if the model groups multi-word entities like "Marie Curie" or "New York". To test this, I ran inference on a set of prompts with either a multi-word entity or the reversed version (i.e. "New York" or "York New") and compared the learned increment at the space token. The prompts were "A B", "the A B", "I visited A B", "near A B", and "they went to A B".

The results show that there was no difference in spacing in L0 (as expected) but the spacing is significantly smaller in the other layers for the real direction ("New York") vs the reversed direction ("York New").

Layer (increment from)	δ real order	δ reversed	% smaller space for real order	p (two-sided)
L0 (byte identity)	9.99^[1]	9.99	0%	1.0
L1	1.42	1.43	51%	0.28 (n.s.)
L2	1.43	1.54	71%	3e-5
L3	0.06	0.10	66%	6e-5
L4	0.86	1.21	77%	3e-8
L5	0.47	0.64	78%	3e-7

Since the model is predicting spacing before seeing the second word, this only works if the model can predict that the word will be continued ("New [York]") and didn't work with fake multi-word entities like "Zorblax [Quimby]".

Mostly Loss Neutral

I consistently found that the learned position increments have a barely detectable effect on loss and perplexity.

Training loss curve from Weights and Biases showing 7 different training runs with nearly-identical loss dropping from 3.5 to approximately 1.13 over the course of 50,000 steps. — Training loss for 7 different architectures including a baseline (byte_rope_bilingual) and some additional versions not described here, showing no visible loss difference except for a few spikes where learned positional increments are briefly worse.

Since the models do learn meaningful position increments, this implies that they must provide some benefit (or else there would be no gradient pressure), but I suspect that positional encoding is just not the bottleneck for LM performance.

Supporting evidence for this is that LMs can work around a complete lack of positional information (Haviv et al., 2022).

Limitations

I only trained a small number of models and with very little variation between architectures.
Because the learned position increments didn't meaningfully improve loss, the gradient signal for them to be useful is very weak. In practice, they seemed to be consistent and meaningful, but I only inspected a small number of models and layers.
I never trained a large model from scratch and it's unclear if the models learn the same position increments during fine-tuning as they would when learning from scratch.
I didn't train per-layer position increment vectors on a large model.

Future Work

The method appears to work, but the real test will be if we can find anything interesting from this data. Some things I think it might be useful for are:

Finding summary positions, where inspecting the model with other tools would be particularly useful. For example, the last token before a large positional increment may be interesting.
Understanding what a model is looking for each layer, especially open-ended investigation of larger models.

I also think the structure may be more interesting with different data sets. For example, I found that a model trained on code detected different kinds of structure in each layer.

There are also improvements that could be made to the method:

Determining the best way to train the per-layer position increment vectors. Per-token increments trained easily, but per-layer vectors required additional oversight and I doubt that my method and hyperparameters were the best way to do this. I just used the first method that worked.
Investigating a version of ALiBi with a learned per-token penalty — the forget gate from Selective RoPE (Movahedi et al., 2025). I was able to train models with this architecture but haven't tried to interpret the results yet.
Figuring out a way to learn more forward-looking position increments. Right now, when generating the increment for "New ", the model needs to decide on the space increment before it sees "York". BPE helps with this somewhat since spaces usually get collapsed, but I wonder if we could allow a model to retroactively change the increments on seeing later words, but I'm not sure if this can be done without making training unstable.

I also fine-tuned an existing model with learned per-token position increments to see if I could add this to an existing model, and found that the increments were changing in the expected directions (very slowly), but I haven't tried the per-layer version or inspected the results yet, and getting results on the scale of my other results would require either tuning or a much longer run.

Weights and Biases screenshot showing learn increments (deltas) over time, with the standard deviation increasing from 0 to 0.006, and the min and max changing by about 0.001 each. — Learned position increment stats for a fine-tuning run on SmolLM2-1.7B

I'm always interested in discussing this further if anyone's interested. I'm working independently, so it's very difficult for me to keep track of what's going on in the mech interp world on my own.

Related Work

Learned, input-dependent positions have been proposed several times; I came to most of this after running the experiments.

CARoPE (Veisi et al., 2025) accumulates per-token, per-head, per-frequency-band rotation frequencies; my scalar increment is a strict special case (one value shared across all bands and heads), so I claim no mechanical novelty for the scalar variant — the contribution here is the interpretability angle.
CoPE (Golovneva et al., 2024) advances position by a contextual gate (a sigmoid of query–key interactions), intended as a soft counter of salient tokens; mine is a per-token increment that can run the position clock faster or slower than one-per-token.
Selective RoPE (Movahedi et al., 2025) is closest to my per-layer variant — input-dependent arbitrary rotation angles, mostly on gated/linear-attention models — and explicitly leaves analysis of the learned phase gate to future work, which I do here.
Layer-specific RoPE scaling (Wang et al., 2025) applies a fixed, input-independent per-layer frequency rescale; my per-layer increments are learned and input-dependent.

Code

All code is available on GitHub at brendanlong/learned-position-increments-experiment.

^{^}
Our per-layer model is bounded with delta_max = 10, so interpret any value of ~10 as an increment "as high as the model is allowed to set it".

[-]Joseph Eisner1mo50

This is really interesting!

I'm curious what happens if you relax the accumulating sum, allowing the model to "reposition" the tokens as it desires. Due to the causal mask the model already has access to the ordering (which is why NoPE is an effective positional encoding), but this might allow it to move related words near eachother, even if they are not adjacent in the sentence.

Moreover, you could provide multiple dimensions of "position" with which to do this, by say having a rotary encoding on the first half of the latent vectors and then a separate independently learned encoding on the second half.

I wonder how (or if) the model would use this to group tokens at various layers.

[-]Brendan Long1mo20

I'm not sure how interesting this is, but I relaxed the positive-increment constraint and removed the sigmoid, and the model only learned one negative increment, specifically the last byte of the Chinese possessive modifier (的), although 's in English sometimes got slightly-negative increments.

There's definitely aspects of this that make me think it's constrainted by the inability to change its mind, since s' in English (plural possessive) gets a larger increment, presumably because it's too late to change "s" and the model likes to separate on punctuation. I'm guessing if we can figure out a way to let the model decide on spacing after seeing the whole sentence it would do something more interesting.

[-]Joseph Eisner1mo10

Interesting! Currently, you are deciding the increment to the next token... using which activations? Post transformer? Post MLP?

It seems like maybe what you're looking for is, for each layer, determine the increment from the previous token after the previous layer's MLP (for the current token). Since layer 0 you have no previous layer to reference, maybe it just uses standard RoPE, or maybe you do something similar to what you're doing now where you determine the layer 0 increment based on the end representation of the previous token.

The DeltaMLP blocks consume the incoming residual stream (or embeddings) for the current token between the transformer blocks, so we always have one (although advancing the position before position 0's is meaningless since only relative position matters). For layer 0, we use the token's embedding, so the position is fully static per-token.

From the attention section's perspective, we're feeding in the embedding or residual the same way we normally would, and the only difference is that Q and K are also rotated based on the sum of all calculated position increments instead of the position count.

I think the requirement that increments be positive probably isn't necessary, although the problem where the model can't look forward prevents a lot of uses for this (it might work decently with BPE though).

For retroactively changing positions, I think training would be unstable and finnicky, but it would be interesting if you could get it to work.

Something related that I'm interested in is BLT/H-Net style models that do dynamic chunking instead of tokenization. H-Net does this at multiple scales, so it could theoretically group words in the first layer and concepts in the second. I haven't actually tried to train or inspect on yet though.

[-]Elliot Callender1mo30

Agree with Joseph, this is really cool stuff!

Looks to me like intermediate layers are using positions in a way very alien to humans; like, there's some obvious "natural" semantic segmentation, but I'm not discerning a legible pattern to the distances between individual chars.

[-]Brendan Long1mo30

I'm not sure if I would read too much into the individual character variations yet, since this model is small and the current setup is hacky and not well optimized. I'm hoping to do a larger run once I can get the training to be more stable. I have some experiments running now trying to balance weight decay/penalties to find a version that's stable without just collapsing everything back to 1.