How Far Apart Does a Model Think Its Tokens Are?

Brendan Long

Instead of using static position increments (+1) per token, RoPE-based language models can learn per-token and per-layer position increments. This has no detectable effect on model performance but allows us to see what the model thinks the distance is between each position and how this varies per-layer.

Six rows, one per layer (L0–L5), showing the Marie Curie sentence with characters spaced by each layer's learned increments, each row scaled independently. Faint lines connect each character across layers; gaps at spaces, the hyphen, and commas shift from layer to layer.

I think this might be useful as another technique to inspect "where the model is looking" in addition to plotting attention patterns (and with similar limitations). The patterns can also hint at what the model is looking for at each layer (when position increments match different kinds of boundaries).

Note: This is still partially a solution in search of a problem. I'm hoping to help with the "searching under lamp posts" problem by finding more lamp posts, but there's additional work to be done here to see if this is actually useful or just a novelty.

AI disclaimer: The Architecture, Learned Position Increments, and Related Work sections were originally drafted by Claude before being (heavily) human-edited.

Introduction

Standard LLMs use Rotary Position Embeddings (RoPE) to encode the location of each position by rotating the key and query vectors by angles proportional to the number of tokens between the two positions.

Standard RoPE assumes that each token advances the position counter by +1, but we can train a model to advance the position counter by a learned increment per-token. Going further, we can learn a per-layer position increment vector, allowing us to calculate content-based position increments at any layer of the model.

Method

Architecture

The models are small decoder-only transformers — 256-dimensional, 8 heads, 6 layers, ~6.4M parameters, with RMSNorm, SwiGLU MLPs, and RoPE (θ = 10,000) — directly on raw UTF-8 bytes rather than BPE tokens. The vocabulary is 257 symbols: 256 byte values plus a document separator.

I focus on byte-level transformers because they need to find their own word boundaries, which makes the early-layer behavior more interesting. This technique also works on BPE models, but the per-token position increments aren't as interesting since some aggregation has already been done by the tokenizer.

Learned position increments

Standard RoPE advances the position counter by +1 per token and rotates each query and key by an angle proportional to that position. I replace the fixed +1 with a learned, per-token increment. A small MLP — DeltaMLP (Linear → GELU → Linear → softplus) — reads a token's hidden state and emits a strictly positive increment δ.

A token's position is the running sum of the increments up to and including it, and I apply the ordinary RoPE rotation using the calculated position.

I initialize the MLP's output bias so that δ ≈ 1 everywhere, so each model starts as exact integer-position RoPE and any deviation is learned. Because positions are still a cumulative sum, the rotation between a query and a key continues to only depend on the difference between their learned positions.

The idea of learning positional increments isn't unique or novel. See Related Work for other papers which have tried similar things (generally for capabilities reasons).

I study two variants:

Shared: one DeltaMLP reads the token embeddings, so δ depends only on the token and is identical at every layer.
Per-layer: each layer has its own DeltaMLP that reads that layer's hidden state, so δ varies per-layer and takes the full residual into account. Hidden-state norms grow with depth, so for stability I RMSNorm the input and use a sigmoid to bound the max increment to max_delta = 10.

Data and training

I train on one epoch of an even mix of English and Chinese Wikipedia (wikimedia/wikipedia configs 20231101.en and 20231101.zh) at a 512-byte context length, with a held-out validation split drawn from disjoint documents. Each model trains for 50k steps with AdamW (learning rate 1e-3, weight decay 0.01, cosine schedule, gradient clipping) in bf16. For the loss comparison I train standard RoPE and both shared and per-layer learned increment RoPE, under identical settings.

Chinese characters are represented in UTF-8 as a lead byte (0xE4–0xE9) followed by two continuation bytes, so I predicted that English capital letters and Chinese lead bytes would be treated similarly by the models.

Results

Per-Token Increments

On the bilingual English and Chinese language model, I found that the models learned smaller increments for lowercase characters and word-internal bytes and larger increments for uppercase letters, start-of-word bytes, punctuation and other boundaries.

Category	Examples	Learned Increment δ
English (lowercase)	a-z	0.68–0.96 (mean 0.79)
Chinese (continuation byte)	`0x80–0xBF`	0.73–0.86 (mean 0.80)
Chinese (lead byte)	`0xE4–0xE9`	0.84–0.98 (mean 0.92)
Word boundary	space	1.05
English (uppercase)	A-Z	1.01–1.29 (mean 1.10)
Punctuation	. , ; ! ?	1.10–1.29 (mean 1.18)
Line boundary	newline	2.12
Other boundaries	EOS	2.90

English uppercase letters and Chinese lead bytes both show larger gaps than lowercase and continuation bytes. Since Chinese lead bytes are significantly more common than uppercase letters, it makes sense that the model seems to consider uppercase to be a stronger signal of a boundary.

If we plot each character spaced by their relative position increments, we can visually see how close the model thinks characters are together:

The sentence "Marie Curie, a Polish-born physicist, won two Nobel Prizes." shown in a single row, with each character placed so that the gap to its left is proportional to its learned position increment. Lowercase letters within a word sit close together; capital letters and punctuation sit a little farther from the character before them; and the spaces between words are the widest gaps.

In Chinese, we (unfortunately) can't display individual bytes so we sum the increments for each character, causing the average character spacing to be very uniform with no obvious word boundaries.

The Chinese sentence '人工智能是计算机科学的一个分支。' with each character placed at its learned position; the sixteen characters are spaced almost perfectly evenly.

First Layer of Per-Layer Model

On the per-layer model, I found that the learned positions tended to explode by default, so I bounded them to max_delta = 10.

The model trained with that architecture found larger increments but shows the same pattern as the shared-MLP model for the first layer.

Category	Examples	Learned Increment δ (L0)
English (lowercase)	a-z	1.21–2.53 (mean 1.64)
Chinese (continuation byte)	`0x80–0xBF`	1.57–2.08 (mean 1.79)
Chinese (lead byte)	`0xE4–0xE9`	2.04–2.72 (mean 2.43)
English (uppercase)	A-Z	2.87–9.98^[1] (mean 9.52)
Punctuation	. , ; ! ?	9.80–9.98 (mean 9.90)
Other boundaries	EOS	9.82
Word boundary	space	9.99
Line boundary	newline	9.99

Chinese Word Boundaries

Since Chinese doesn't have spaces between words, I was interested to see if the model would learn word boundaries from Chinese text without punctuation, so I ran my per-layer model on held-out text from Chinese Wikipedia and compared my learned increments to word boundaries detected by jieba (a Chinese word segmenter).

I measured how well the learned increment at each layer separates true word boundaries from non-boundaries, as an ROC-AUC (0.5 = chance, 0.0 or 1.0 = perfect). I score only the gaps between two Chinese characters (no space or punctuation), using the increment at the next character's leading byte.

Layer (increment computed from)	Chinese word-boundary AUC
L0 (byte identity)	0.50 (chance)
L1	0.54
L2	0.68
L3	0.37
L4	0.63
L5	0.47

The first layer is unable to detect word boundaries since it only sees the byte's embedding and has no contextual information, but the middle layers (L2–L4) are able to distinguish word boundaries (although L3 seems to be compressing boundaries rather than expanding them).

Per-Layer Plots

We plot the same sentences from above but using per-layer position increments. Each layer is scaled independently to make the results legible.

Six rows (L0–L5) of the Chinese sentence. Layer 0 is nearly uniform; the middle layers pack the multi-character words 人工智能 and 计算机科学 closer together and open small gaps around the single-character words 是 and 的 and before the period.

If we remove the per-layer normalization, we can also see that later layers want smaller position increments.

The same six-layer plot of the Marie Curie sentence with all layers on one shared scale. Early layers (L0–L2) spread the characters across a wide span while later layers (L3–L5) compress them into a much shorter one.

Grouping Multi-word Entities

The plots above made me wonder if the model groups multi-word entities like "Marie Curie" or "New York". To test this, I ran inference on a set of prompts with either a multi-word entity or the reversed version (i.e. "New York" or "York New") and compared the learned increment at the space token. The prompts were "A B", "the A B", "I visited A B", "near A B", and "they went to A B".

The results show that there was no difference in spacing in L0 (as expected) but the spacing is significantly smaller in the other layers for the real direction ("New York") vs the reversed direction ("York New").

Layer (increment from)	δ real order	δ reversed	% smaller space for real order	p (two-sided)
L0 (byte identity)	9.99^[1]	9.99	0%	1.0
L1	1.42	1.43	51%	0.28 (n.s.)
L2	1.43	1.54	71%	3e-5
L3	0.06	0.10	66%	6e-5
L4	0.86	1.21	77%	3e-8
L5	0.47	0.64	78%	3e-7

Since the model is predicting spacing before seeing the second word, this only works if the model can predict that the word will be continued ("New [York]") and didn't work with fake multi-word entities like "Zorblax [Quimby]".

Loss Neutral

I consistently found that the learned position increments have no detectable effect on loss or perplexity.

Training loss curve from Weights and Biases showing 7 different training runs with nearly-identical loss dropping from 3.5 to approximately 1.13 over the course of 50,000 steps.

Since the models do learn meaningful position increments, this implies that they must provide some benefit (or else there would be no gradient pressure), but I suspect that positional encoding is not the bottleneck for LM performance, so while LMs will use the easier loss landscape of learned position increments, they don't need it.

Supporting evidence for this is that LMs can work around a complete lack of positional information (Haviv et al., 2022).

Limitations

I only trained a small number of models and with very little variation between architectures.
Because the learned position increments didn't meaningfully improve loss, the gradient signal for them to be useful is very weak. In practice, they seemed to be consistent and meaningful, but I only inspected a small number of models and layers.
I never trained a large model from scratch and it's unclear if the models learn the same position increments during fine-tuning as they would when learning from scratch.
I didn't train per-layer position increment vectors on a large model.

Future Work

The method appears to work, but the real test will be if we can find anything interesting from this data. Some things I think it might be useful for are:

Finding summary positions, where inspecting the model with other tools would be particularly useful. For example, the last token before a large positional increment may be interesting.
Understanding what a model is looking for each layer, especially open-ended investigation of larger models.

I also think the structure may be more interesting with different data sets. For example, I found that a model trained on code detected different kinds of structure in each layer.

There are also improvements that could be made to the method:

Determining the best way to train the per-layer position increment vectors. Per-token increments trained easily, but per-layer vectors required additional oversight and I doubt that my method and hyperparameters were the best way to do this. I just used the first method that worked.
Investigating a version of ALiBi with a learned per-token penalty — the forget gate from Selective RoPE (Movahedi et al., 2025). I was able to train models with this architecture but haven't tried to interpret the results yet.
Figuring out a way to learn more forward-looking position increments. Right now, when generating the increment for "New ", the model needs to decide on the space increment before it sees "York". BPE helps with this somewhat since spaces usually get collapsed, but I wonder if we could allow a model to retroactively change the increments on seeing later words, but I'm not sure if this can be done without making training unstable.

I also fine-tuned an existing model with learned per-token position increments to see if I could add this to an existing model, and found that the increments were changing in the expected directions (very slowly), but I haven't tried the per-layer version or inspected the results yet, and getting results on the scale of my other results would require either tuning or a much longer run.

Weights and Biases screenshot showing learn increments (deltas) over time, with the standard deviation increasing from 0 to 0.006, and the min and max changing by about 0.001 each.

I'm always interested in discussing this further if anyone's interested. I'm working independently, so it's very difficult for me to keep track of what's going on in the mech interp world on my own.

Related Work

Learned, input-dependent positions have been proposed several times; I came to most of this after running the experiments.

CARoPE (Veisi et al., 2025) accumulates per-token, per-head, per-frequency-band rotation frequencies; my scalar increment is a strict special case (one value shared across all bands and heads), so I claim no mechanical novelty for the scalar variant — the contribution here is the interpretability angle.
CoPE (Golovneva et al., 2024) advances position by a contextual gate (a sigmoid of query–key interactions), intended as a soft counter of salient tokens; mine is a per-token increment that can run the position clock faster or slower than one-per-token.
Selective RoPE (Movahedi et al., 2025) is closest to my per-layer variant — input-dependent arbitrary rotation angles, mostly on gated/linear-attention models — and explicitly leaves analysis of the learned phase gate to future work, which I do here.
Layer-specific RoPE scaling (Wang et al., 2025) applies a fixed, input-independent per-layer frequency rescale; my per-layer increments are learned and input-dependent.

Code

All code is available on GitHub at brendanlong/learned-position-increments-experiment.

^{^}
Our per-layer model is bounded with delta_max = 10, so interpret any value of ~10 as an increment "as high as the model is allowed to set it".

[-]Joseph Eisner20h50

This is really interesting!

I'm curious what happens if you relax the accumulating sum, allowing the model to "reposition" the tokens as it desires. Due to the causal mask the model already has access to the ordering (which is why NoPE is an effective positional encoding), but this might allow it to move related words near eachother, even if they are not adjacent in the sentence.

Moreover, you could provide multiple dimensions of "position" with which to do this, by say having a rotary encoding on the first half of the latent vectors and then a separate independently learned encoding on the second half.

I wonder how (or if) the model would use this to group tokens at various layers.

[-]Brendan Long9h20

I'm not sure how interesting this is, but I relaxed the positive-increment constraint and removed the sigmoid, and the model only learned one negative increment, specifically the last byte of the Chinese possessive modifier (的), although 's in English sometimes got slightly-negative increments.

There's definitely aspects of this that make me think it's constrainted by the inability to change its mind, since s' in English (plural possessive) gets a larger increment, presumably because it's too late to change "s" and the model likes to separate on punctuation. I'm guessing if we can figure out a way to let the model decide on spacing after seeing the whole sentence it would do something more interesting.

[-]Brendan Long17h20

I think the requirement that increments be positive probably isn't necessary, although the problem where the model can't look forward prevents a lot of uses for this (it might work decently with BPE though).

For retroactively changing positions, I think training would be unstable and finnicky, but it would be interesting if you could get it to work.

Something related that I'm interested in is BLT/H-Net style models that do dynamic chunking instead of tokenization. H-Net does this at multiple scales, so it could theoretically group words in the first layer and concepts in the second. I haven't actually tried to train or inspect on yet though.

[-]Elliot Callender15h30

Agree with Joseph, this is really cool stuff!

Looks to me like intermediate layers are using positions in a way very alien to humans; like, there's some obvious "natural" semantic segmentation, but I'm not discerning a legible pattern to the distances between individual chars.

[-]Brendan Long14h30

I'm not sure if I would read too much into the individual character variations yet, since this model is small and the current setup is hacky and not well optimized. I'm hoping to do a larger run once I can get the training to be more stable. I have some experiments running now trying to balance weight decay/penalties to find a version that's stable without just collapsing everything back to 1.

45