I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.
Update: I added an additional experiment after the initial post, with more specific task evaluations.
Three main findings:
DroPE reduces massive value concentration significantly in Query tensors compared to RoPE.
RoPE relies way more on massive values than DroPE, and disrupting them breaks RoPE but only degrades DroPE.
When massive values are disrupted on LLaMa 2 7B with RoPE, this degrades contextual knowledge (94.3%) far more than parametric (24.5%).DroPE shows dramatically different behavior, contextual degradation is 73% lower (25% vs 94.3%), and parametric accuracy actually improves when disrupted.
Most strikingly, passkey retrieval (where you insert a passkey that the model has to retrieve in some large text) collapses completely under disruption in RoPE but is entirely unaffected in DroPE.
These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don't depend on concentrated features.
Background
What Are Massive Values?
Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:
Being concentrated in low-frequency RoPE dimensions
Being present in Q and K but notably absent in V
Being critical for contextual knowledge understanding tasks (passkey retrieval, sentiment analysis, mathematical reasoning) but not for parametric knowledge retrieval (factual recall)
Jin et al. provide a mechanistic explanation rooted in RoPE's frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency θj=10000(−2j/d). High-frequency components (small j) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large j) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.
What Is DroPE?
DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.
The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training (ϕm(Ctrain)<2π for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor 1/s to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.
These papers make seemingly incompatible claims.
Jin et al. claim that RoPE -> massive values -> essential for contextual understanding Gelberg et al. claim that remove RoPE -> better context extension with preserved capabilities
If massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?
So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?
Extract Q, K, V tensors from all 32 layers using forward hooks on projection outputs
Compute L2 norm matrix M[head, dim] for each tensor
Count positions where M > 5.0 × mean(M) (the definition for massive values)
Repeat across multiple samples and report mean plus minus std
Text samples used: 10 texts including:
Literary: Hobbit, Tale of Two Cities, Moby Dick excerpts
Technical: ML/transformer descriptions
Conversational: Dialogue snippets
Factual: Scientific descriptions
Results
Tensor
RoPE (mean ± std)
DroPE (mean ± std)
Change
Query
1475.5 ± 22.6
901.4 ± 36.0
-38.9%
Key
1496.8 ± 69.8
1331.5 ± 74.1
-11.0%
Value
174.0 ± 10.7
176.6 ± 5.7
+1.5%
Figure 1: Massive value counts for Query, Key, and Value tensors. Error bars show ±1 standard deviation across 10 text samples. DroPE shows 39% reduction in Query and 11% reduction in Key.
We also plot this across layers.
Figure 2: Query massive values by layer. The shaded area shows the reduction from RoPE to DroPE. DroPE consistently has ~17 fewer massive values per layer.
Interpretation
How do we interpret these results?
Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes "what to look for" in attention, which is the model's representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.
Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes "what information is here" at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.
Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.
Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.
However, a closer look at Figure 2 shows DroPE didn't uniformly reduce massive values.
Figure 3: Layer 1 is the only layer where DroPE has MORE massive values than RoPE. This suggests DroPE concentrates some position-independent processing in the first layer.
Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.
Experiment 2: Disruption Experiment
Motivation
Finding 1 shows DroPE has fewer massive values, but are these values still functionallyimportant? We test this by zeroing out massive value dimensions and measuring model degradation.
Methodology
Procedure:
Identify massive value dimensions in Q and K projections (threshold λ=5.0)
Register forward hooks that zero out these specific dimensions
Measure perplexity on held-out text before and after disruption
Compare to control: zeroing same number of random dimensions
Repeat with 10 different random seeds for control condition
Disruption implementation:
# Hook on q_proj output
def hook(module, input, output):
# mask: boolean tensor where True = massive value dimension
zero_mask = (~mask).to(output.dtype) # 0 where massive, 1 elsewhere
return output * zero_mask # Zero out massive dimensions
Metric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) - (Random disruption PPL increase)
and we interpret it as
Higher M-R difference = model relies more on massive values specifically
Results
Raw Perplexity Values
Model
Baseline
Massive Zeroed
Random Zeroed
RoPE
1.30
1,508.5
1.31
DroPE
1.49
22.7
1.49
Percent Increase (mean ± std across 10 seeds)
Model
Massive Disruption
Random Disruption
M-R Difference
RoPE
+115,929% ± 0.0%
+0.6% ± 0.7%
+115,929%
DroPE
+1,421% ± 0.0%
+0.2% ± 1.2%
+1,421%
Figure 4: Perplexity after disruption (log scale). Zeroing massive values breaks RoPE (PPL 1 -> 1508) but only degrades DroPE (PPL 1.5 -> 23). Random controls cause negligible damage.
Statistical validation:
We do some quick statistical tests because this is so cheap to do.
Paired t-test (massive vs random): p < 10⁻⁴⁸ for RoPE, p < 10⁻²⁹ for DroPE
Independent t-test (RoPE vs DroPE): p < 10⁻⁸⁷
Cohen's d > 1000
So I feel fairly confident that these results are significant!
Key ratio: RoPE relies 82× more on massive values than DroPE
Consistency Across Text Types
Text Type
RoPE PPL Increase
DroPE PPL Increase
Literary
+116,000%
+1,400%
Technical
+115,800%
+1,450%
Repetitive
+116,100%
+1,380%
Results are pretty consistent regardless of text content.
Interpretation
RoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.
DroPE model: Zeroing massive values degrades but doesn't break the model. The model has learned alternative mechanisms that partially compensate.
Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.
Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.
Experiment 3: Parametric vs Contextual Knowledge
Background
Jin et al. (2025) demonstrate that massive value disruption affects tasks which use contextual knowledge far more than tasks which use parametric knowledge. We replicate their methodology on both RoPE and DroPE using the some of the same task categories:
Parametric Tasks (facts stored in model weights):
Cities (n=200): Yes/no factual statements ("Paris is in France")
Sports (n=100): Yes/no sports knowledge questions
Contextual Tasks (information extracted from input context):
IMDB (n=100): Sentiment classification from movie reviews
Passkey (n=20): Retrieve 5-digit number hidden in filler text
Disruption Method: We replace the top-1 massive dimension per head with the mean value.
Results
The results can be seen below.
Model
Category
Task
Baseline
Disrupted
Degradation
RoPE
parametric
cities
89.5%
57.5%
35.8%
RoPE
parametric
sports
60.0%
52.0%
13.3%
RoPE
contextual
imdb
44.0%
5.0%
88.6%
RoPE
contextual
passkey
100.0%
0.0%
100.0%
DroPE
parametric
cities
79.0%
88.5%
−12.0%
DroPE
parametric
sports
73.0%
67.0%
8.2%
DroPE
contextual
imdb
30.0%
15.0%
50.0%
DroPE
contextual
passkey
60.0%
60.0%
0.0%
Aggregated by category, it looks like:
Figure 5: Degradation comparison across all tasks. RoPE shows significant contextual degradation, while DroPE is more robust.
This is basically what we said in the summary: we replicate the results found in Jin et al., where RoPE is critical to contextual knowledge tasks but not parametric knowledge tasks. However, in DRoPE, contextual knowledge tasks are degraded way less, and parametric knowledge tasks seem to even improve slightly.
So massive values appear non-functional for information storage. The passkey task shows this best.
Figure 6: Passkey retrieval results. RoPE completely collapses (100% degradation) while DroPE is entirely unaffected (0% degradation).
This seems to demonstrate that DroPE's contextual retrieval mechanism is entirely independent of massive values, and the model has learned alternative attention patterns that don't rely on value concentration.
Interpretation
How should we interpret these results? The DroPE improvement under disruption suggests massive values are vestigial artifacts from the original RoPE training. To recap the evidence:
During RoPE pretraining, massive values encode positional information
DroPE removes positional embeddings and recalibrates
Massive values persist structurally but lose their function
They may actually create interference, explaining the improvement when disrupted
Figure 7. Average degradation by knowledge category. RoPE's contextual knowledge degrades 3.8× more than parametric.Figure 8. Baseline vs disrupted accuracy for all tasks. Note DroPE's Cities accuracy actually improves under disruption. Figure 9: Combined summary of replicating some of Jin et al.'s results.
Resolving the Contradiction
The apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they're used:
Massive values are learned into weights during RoPE training, as opposed to being created by RoPE at inference. The projection matrices W_Q and W_K develop these concentration patterns because RoPE's frequency structure during training creates gradients that favor certain dimensions.
RoPE at inference makes massive values functionally critical. The rotation operation couples these concentrated features to position-dependent attention patterns. Remove the rotation, and the model breaks because the model doesn't know how to use them without positional modulation.
DroPE enables longer contexts. Our findings suggest a mechanism:
RoPE concentrates attention in specific dimensions via massive values
This concentration may create bottlenecks at long contexts
DroPE distributes attention more evenly, potentially enabling better generalization to longer sequences
Why did I do this?
Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment.
I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav's paper while at the gym earlier this morning.
Limitations and Future Work
Models tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can't find the other Sakana models on HuggingFace. Larger models and different architectures may show different patterns.
Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs.
What is actually going on? I'm so confused. I'd like a nicer, more clear mechanistic explanation of what's actually happening. As in, the actual computations.
@article{jin2025massive,
title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
author={Jin, Mingyu and others},
journal={ICML},
year={2025}
}
@article{gelberg2025drope,
title={Dropping Positional Embeddings for Zero-Shot Long-Context Extension},
author={Gelberg, Tal and others},
journal={arXiv preprint arXiv:2512.12167},
year={2025}
}
@techreport{africa2026massive,
title = {Massive Activations in DroPE: Evidence for Attention Reorganization},
author = {Africa, David},
year = {2026},
url = {https://github.com/DavidDemitriAfrica/drope-activations}
}
Summary
I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.
Three main findings:
These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don't depend on concentrated features.
Background
What Are Massive Values?
Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:
Jin et al. provide a mechanistic explanation rooted in RoPE's frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency θj=10000(−2j/d). High-frequency components (small j) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large j) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.
What Is DroPE?
DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.
The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training (ϕm(Ctrain)<2π for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor 1/s to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.
These papers make seemingly incompatible claims.
Jin et al. claim that RoPE -> massive values -> essential for contextual understanding
Gelberg et al. claim that remove RoPE -> better context extension with preserved capabilities
If massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?
So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?
Experiment 1: Massive Value Comparison
Methodology
Models compared:
meta-llama/Llama-2-7b-hf(standard RoPE)SakanaAI/Llama-2-7b-hf-DroPE(RoPE removed + recalibrated)Procedure:
Text samples used: 10 texts including:
Results
We also plot this across layers.
Interpretation
How do we interpret these results?
Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes "what to look for" in attention, which is the model's representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.
Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes "what information is here" at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.
Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.
Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.
However, a closer look at Figure 2 shows DroPE didn't uniformly reduce massive values.
Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.
Experiment 2: Disruption Experiment
Motivation
Finding 1 shows DroPE has fewer massive values, but are these values still functionally important? We test this by zeroing out massive value dimensions and measuring model degradation.
Methodology
Procedure:
Disruption implementation:
Metric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) - (Random disruption PPL increase)
and we interpret it as
Higher M-R difference = model relies more on massive values specifically
Results
Raw Perplexity Values
Percent Increase (mean ± std across 10 seeds)
Statistical validation:
We do some quick statistical tests because this is so cheap to do.
So I feel fairly confident that these results are significant!
Key ratio: RoPE relies 82× more on massive values than DroPE
Consistency Across Text Types
Results are pretty consistent regardless of text content.
Interpretation
RoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.
DroPE model: Zeroing massive values degrades but doesn't break the model. The model has learned alternative mechanisms that partially compensate.
Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.
Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.
Experiment 3: Parametric vs Contextual Knowledge
Background
Jin et al. (2025) demonstrate that massive value disruption affects tasks which use contextual knowledge far more than tasks which use parametric knowledge. We replicate their methodology on both RoPE and DroPE using the some of the same task categories:
Parametric Tasks (facts stored in model weights):
Contextual Tasks (information extracted from input context):
Disruption Method: We replace the top-1 massive dimension per head with the mean value.
Results
The results can be seen below.
Aggregated by category, it looks like:
This is basically what we said in the summary: we replicate the results found in Jin et al., where RoPE is critical to contextual knowledge tasks but not parametric knowledge tasks. However, in DRoPE, contextual knowledge tasks are degraded way less, and parametric knowledge tasks seem to even improve slightly.
So massive values appear non-functional for information storage. The passkey task shows this best.
This seems to demonstrate that DroPE's contextual retrieval mechanism is entirely independent of massive values, and the model has learned alternative attention patterns that don't rely on value concentration.
Interpretation
How should we interpret these results? The DroPE improvement under disruption suggests massive values are vestigial artifacts from the original RoPE training. To recap the evidence:
Resolving the Contradiction
The apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they're used:
Why did I do this?
Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment.
I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav's paper while at the gym earlier this morning.
Limitations and Future Work
Models tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can't find the other Sakana models on HuggingFace. Larger models and different architectures may show different patterns.
Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs.
What is actually going on? I'm so confused. I'd like a nicer, more clear mechanistic explanation of what's actually happening. As in, the actual computations.
Reproducibility
Code
All experiments can be reproduced here:
Hardware
Parameters
Citation
If you use these findings, please cite: