I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.
Two main findings:
DroPE reduces massive value concentration significantly in Query tensors compared to RoPE.
RoPE relies way more on massive values than DroPE, and disrupting them breaks RoPE but only degrades DroPE.
These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don't depend on concentrated features.
Background
What Are Massive Values?
Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:
Being concentrated in low-frequency RoPE dimensions
Being present in Q and K but notably absent in V
Being critical for contextual knowledge understanding tasks (passkey retrieval, sentiment analysis, mathematical reasoning) but not for parametric knowledge retrieval (factual recall)
Jin et al. provide a mechanistic explanation rooted in RoPE's frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency θj=10000(−2j/d). High-frequency components (small j) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large j) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.
What Is DroPE?
DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.
The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training (ϕm(Ctrain)<2π for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor 1/s to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.
These papers make seemingly incompatible claims.
Jin et al. claim that RoPE -> massive values -> essential for contextual understanding Gelberg et al. claim that remove RoPE -> better context extension with preserved capabilities
If massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?
So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?
Extract Q, K, V tensors from all 32 layers using forward hooks on projection outputs
Compute L2 norm matrix M[head, dim] for each tensor
Count positions where M > 5.0 × mean(M) (the definition for massive values)
Repeat across multiple samples and report mean plus minus std
Text samples used: 10 texts including:
Literary: Hobbit, Tale of Two Cities, Moby Dick excerpts
Technical: ML/transformer descriptions
Conversational: Dialogue snippets
Factual: Scientific descriptions
Results
Tensor
RoPE (mean ± std)
DroPE (mean ± std)
Change
Query
1475.5 ± 22.6
901.4 ± 36.0
-38.9%
Key
1496.8 ± 69.8
1331.5 ± 74.1
-11.0%
Value
174.0 ± 10.7
176.6 ± 5.7
+1.5%
Figure 1: Massive value counts for Query, Key, and Value tensors. Error bars show ±1 standard deviation across 10 text samples. DroPE shows 39% reduction in Query and 11% reduction in Key.
We also plot this across layers.
Figure 2: Query massive values by layer. The shaded area shows the reduction from RoPE to DroPE. DroPE consistently has ~17 fewer massive values per layer.
Interpretation
How do we interpret these results?
Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes "what to look for" in attention, which is the model's representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.
Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes "what information is here" at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.
Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.
Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.
However, a closer look at Figure 2 shows DroPE didn't uniformly reduce massive values.
Figure 3: Layer 1 is the only layer where DroPE has MORE massive values than RoPE. This suggests DroPE concentrates some position-independent processing in the first layer.
Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.
Experiment 2: Disruption Experiment
Motivation
Finding 1 shows DroPE has fewer massive values, but are these values still functionallyimportant? We test this by zeroing out massive value dimensions and measuring model degradation.
Methodology
Procedure:
Identify massive value dimensions in Q and K projections (threshold λ=5.0)
Register forward hooks that zero out these specific dimensions
Measure perplexity on held-out text before and after disruption
Compare to control: zeroing same number of random dimensions
Repeat with 10 different random seeds for control condition
Disruption implementation:
# Hook on q_proj output
def hook(module, input, output):
# mask: boolean tensor where True = massive value dimension
zero_mask = (~mask).to(output.dtype) # 0 where massive, 1 elsewhere
return output * zero_mask # Zero out massive dimensions
Metric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) - (Random disruption PPL increase)
and we interpret it as
Higher M-R difference = model relies more on massive values specifically
Results
Raw Perplexity Values
Model
Baseline
Massive Zeroed
Random Zeroed
RoPE
1.30
1,508.5
1.31
DroPE
1.49
22.7
1.49
Percent Increase (mean ± std across 10 seeds)
Model
Massive Disruption
Random Disruption
M-R Difference
RoPE
+115,929% ± 0.0%
+0.6% ± 0.7%
+115,929%
DroPE
+1,421% ± 0.0%
+0.2% ± 1.2%
+1,421%
Figure 4: Perplexity after disruption (log scale). Zeroing massive values breaks RoPE (PPL 1 -> 1508) but only degrades DroPE (PPL 1.5 -> 23). Random controls cause negligible damage.
Statistical validation:
We do some quick statistical tests because this is so cheap to do.
Paired t-test (massive vs random): p < 10⁻⁴⁸ for RoPE, p < 10⁻²⁹ for DroPE
Independent t-test (RoPE vs DroPE): p < 10⁻⁸⁷
Cohen's d > 1000
So I feel fairly confident that these results are significant!
Key ratio: RoPE relies 82× more on massive values than DroPE
Consistency Across Text Types
Text Type
RoPE PPL Increase
DroPE PPL Increase
Literary
+116,000%
+1,400%
Technical
+115,800%
+1,450%
Repetitive
+116,100%
+1,380%
Results are pretty consistent regardless of text content.
Interpretation
RoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.
DroPE model: Zeroing massive values degrades but doesn't break the model. The model has learned alternative mechanisms that partially compensate.
Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.
Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.
Takes
The apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they're used:
Massive values are learned into weights during RoPE training, as opposed to being created by RoPE at inference. The projection matrices W_Q and W_K develop these concentration patterns because RoPE's frequency structure during training creates gradients that favor certain dimensions.
RoPE at inference makes massive values functionally critical. The rotation operation couples these concentrated features to position-dependent attention patterns. Remove the rotation, and the model breaks because the model doesn't know how to use them without positional modulation.
DroPE recalibration teaches alternative usage patterns. During the brief recalibration phase, it seems the model learns to:
Reduce concentration
Distribute information more evenly across dimensions
Perform attention based on content similarity alone
Why did I do this?
Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment.
I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav's paper while at the gym earlier this morning.
Limitations and Future Work
Models tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can't find the other models on HuggingFace. Larger models and different architectures may show different patterns.
Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs. I
Task-specific analysis: We measured perplexity. Testing on Jin et al.'s contextual vs. parametric knowledge tasks would directly validate whether DroPE's reorganization preserves contextual understanding through alternative mechanisms. I'm doing this as we speak.
@article{jin2025massive,
title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
author={Jin, Mingyu and others},
journal={ICML},
year={2025}
}
@article{gelberg2025drope,
title={Dropping Positional Embeddings for Zero-Shot Long-Context Extension},
author={Gelberg, Tal and others},
journal={arXiv preprint arXiv:2512.12167},
year={2025}
}
@techreport{africa2026massive,
title = {Massive Activations in DroPE: Evidence for Attention Reorganization},
author = {Africa, David},
year = {2026},
url = {https://github.com/DavidDemitriAfrica/drope-activations}
}
Summary
I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.
Two main findings:
These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don't depend on concentrated features.
Background
What Are Massive Values?
Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:
Jin et al. provide a mechanistic explanation rooted in RoPE's frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency θj=10000(−2j/d). High-frequency components (small j) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large j) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.
What Is DroPE?
DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.
The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training (ϕm(Ctrain)<2π for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor 1/s to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.
These papers make seemingly incompatible claims.
Jin et al. claim that RoPE -> massive values -> essential for contextual understanding
Gelberg et al. claim that remove RoPE -> better context extension with preserved capabilities
If massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?
So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?
Experiment 1: Massive Value Comparison
Methodology
Models compared:
meta-llama/Llama-2-7b-hf(standard RoPE)SakanaAI/Llama-2-7b-hf-DroPE(RoPE removed + recalibrated)Procedure:
Text samples used: 10 texts including:
Results
We also plot this across layers.
Interpretation
How do we interpret these results?
Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes "what to look for" in attention, which is the model's representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.
Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes "what information is here" at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.
Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.
Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.
However, a closer look at Figure 2 shows DroPE didn't uniformly reduce massive values.
Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.
Experiment 2: Disruption Experiment
Motivation
Finding 1 shows DroPE has fewer massive values, but are these values still functionally important? We test this by zeroing out massive value dimensions and measuring model degradation.
Methodology
Procedure:
Disruption implementation:
Metric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) - (Random disruption PPL increase)
and we interpret it as
Higher M-R difference = model relies more on massive values specifically
Results
Raw Perplexity Values
Percent Increase (mean ± std across 10 seeds)
Statistical validation:
We do some quick statistical tests because this is so cheap to do.
So I feel fairly confident that these results are significant!
Key ratio: RoPE relies 82× more on massive values than DroPE
Consistency Across Text Types
Results are pretty consistent regardless of text content.
Interpretation
RoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.
DroPE model: Zeroing massive values degrades but doesn't break the model. The model has learned alternative mechanisms that partially compensate.
Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.
Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.
Takes
The apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they're used:
Why did I do this?
Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment.
I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav's paper while at the gym earlier this morning.
Limitations and Future Work
Models tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can't find the other models on HuggingFace. Larger models and different architectures may show different patterns.
Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs. I
Task-specific analysis: We measured perplexity. Testing on Jin et al.'s contextual vs. parametric knowledge tasks would directly validate whether DroPE's reorganization preserves contextual understanding through alternative mechanisms. I'm doing this as we speak.
Reproducibility
Code
All experiments can be reproduced here:
Hardware
Parameters
Citation
If you use these findings, please cite: