Massive Activations in DroPE: Evidence for Attention Reorganization

David Africa

Summary

I do a quick experiment to investigate how DroPE (Dropping Positional Embeddings) models differ from standard RoPE models in their use of "massive values" (that is, concentrated large activations in Query and Key tensors) that prior work identifies as important for contextual understanding. I did this in my personal time, for fun.

Update: I added an additional experiment after the initial post, with more specific task evaluations.

Three main findings:

DroPE reduces massive value concentration significantly in Query tensors compared to RoPE.
RoPE relies way more on massive values than DroPE, and disrupting them breaks RoPE but only degrades DroPE.
When massive values are disrupted on LLaMa 2 7B with RoPE, this degrades contextual knowledge (94.3%) far more than parametric (24.5%). DroPE shows dramatically different behavior, contextual degradation is 73% lower (25% vs 94.3%), and parametric accuracy actually improves when disrupted.
1. Most strikingly, passkey retrieval (where you insert a passkey that the model has to retrieve in some large text) collapses completely under disruption in RoPE but is entirely unaffected in DroPE.

These findings suggest that, during recalibration, DroPE learns alternative attention mechanisms that don't depend on concentrated features.

Background

What Are Massive Values?

Massive values are unusually large activations in the Query (Q) and Key (K) tensors of transformer attention layers. They were identified by Jin et al. (2025) to have the pattern of:

Being concentrated in low-frequency RoPE dimensions
Being present in Q and K but notably absent in V
Being critical for contextual knowledge understanding tasks (passkey retrieval, sentiment analysis, mathematical reasoning) but not for parametric knowledge retrieval (factual recall)

Jin et al. provide a mechanistic explanation rooted in RoPE's frequency structure. RoPE divides the head dimension into pairs, each rotating at a frequency . High-frequency components (small $j$ ) change rapidly with position, encoding fine-grained positional information. Low-frequency components (large $j$ ) change slowly, and Jin et al. argue these dimensions primarily encode semantic content rather than position. They find that when they disrupt these massive values, this devastates contextual understanding tasks while parametric knowledge tasks show only a little bit of degradation.

What Is DroPE?

DroPE (Gelberg et al., 2025) is a method that removes Rotary Position Embeddings (RoPE) from pretrained models and recalibrates them, which has the effect of zero-shot extending context length.

The claim here is like: RoPE scaling methods (PI, YaRN, NTK) attempt to extend context by compressing rotation frequencies. But low-frequency RoPE components never complete a full rotation during training ( $ϕ_{m} (C_{t r a i n}) < 2 π$ for small ωₘ). At extended lengths, these phases become out-of-distribution, so any scaling method must compress low frequencies by factor $1 / s$ to keep phases in range. But this compression shifts attention weights at long distances, where semantic matching matters a lot.

These papers make seemingly incompatible claims.

Jin et al. claim that RoPE -> massive values -> essential for contextual understanding
Gelberg et al. claim that remove RoPE -> better context extension with preserved capabilities

If massive values are caused by RoPE and critical for understanding, how does DroPE maintain performance?

So we can check a pretty neat and well-scoped research question: are massive values a cause or consequence of contextual knowledge capabilities? And the proxy test we can do cheaply here is: does DroPE, after recalibration, still have massive values?

Experiment 1: Massive Value Comparison

Methodology

Models compared:

meta-llama/Llama-2-7b-hf (standard RoPE)
SakanaAI/Llama-2-7b-hf-DroPE (RoPE removed + recalibrated)

Procedure:

Load both models with identical tokenizer
Process N diverse text samples
Extract Q, K, V tensors from all 32 layers using forward hooks on projection outputs
Compute L2 norm matrix M[head, dim] for each tensor
Count positions where M > 5.0 × mean(M) (the definition for massive values)
Repeat across multiple samples and report mean plus minus std

Text samples used: 10 texts including:

Literary: Hobbit, Tale of Two Cities, Moby Dick excerpts
Technical: ML/transformer descriptions
Conversational: Dialogue snippets
Factual: Scientific descriptions

Results

Tensor	RoPE (mean ± std)	DroPE (mean ± std)	Change
Query	1475.5 ± 22.6	901.4 ± 36.0	-38.9%
Key	1496.8 ± 69.8	1331.5 ± 74.1	-11.0%
Value	174.0 ± 10.7	176.6 ± 5.7	+1.5%

*Figure 1: Massive value counts for Query, Key, and Value tensors. Error bars show ±1 standard deviation across 10 text samples. DroPE shows 39% reduction in Query and 11% reduction in Key.*

We also plot this across layers.

Figure 2: Layer Distribution — *Figure 2: Query massive values by layer. The shaded area shows the reduction from RoPE to DroPE. DroPE consistently has ~17 fewer massive values per layer.*

Interpretation

How do we interpret these results?

Query shows the largest reduction in number of massive values. Roughly, the Query tensor encodes "what to look for" in attention, which is the model's representation of what information the current position needs. DroPE models learn to distribute this information more evenly across dimensions rather than concentrating it in the low-frequency RoPE dimensions.

Key shows moderate reduction in number of massive values. Roughly, the Key tensor encodes "what information is here" at each position. The smaller reduction suggests some concentration patterns persist, possibly because Key representations must still support some semantic matching.

Value is unchanged, within error bars. Mostly just confirms the Jin et al. finding.

Low variance across text types (std ~2-5% of mean) indicates this is a robust structural property of the models, not dependent on input content.

However, a closer look at Figure 2 shows DroPE didn't uniformly reduce massive values.

Figure 6: Layer 1 Anomaly — *Figure 3: Layer 1 is the only layer where DroPE has MORE massive values than RoPE. This suggests DroPE concentrates some position-independent processing in the first layer.*

Not sure how to interpret this. Possibly, without positional embeddings, DroPE may use layer 1 to establish token relationships through content alone, then rely less on concentrated features in subsequent layers.

Experiment 2: Disruption Experiment

Motivation

Finding 1 shows DroPE has fewer massive values, but are these values still functionally important? We test this by zeroing out massive value dimensions and measuring model degradation.

Methodology

Procedure:

Identify massive value dimensions in Q and K projections (threshold $λ = 5.0$ )
Register forward hooks that zero out these specific dimensions
Measure perplexity on held-out text before and after disruption
Compare to control: zeroing same number of random dimensions
Repeat with 10 different random seeds for control condition

Disruption implementation:

# Hook on q_proj output
def hook(module, input, output):
    # mask: boolean tensor where True = massive value dimension
    zero_mask = (~mask).to(output.dtype)  # 0 where massive, 1 elsewhere
    return output * zero_mask  # Zero out massive dimensions

Metric: Our metric of choice here is M-R Difference = (Massive disruption PPL increase) - (Random disruption PPL increase)

and we interpret it as

Higher M-R difference = model relies more on massive values specifically

Results

Raw Perplexity Values

Model	Baseline	Massive Zeroed	Random Zeroed
RoPE	1.30	1,508.5	1.31
DroPE	1.49	22.7	1.49

Percent Increase (mean ± std across 10 seeds)

Model	Massive Disruption	Random Disruption	M-R Difference
RoPE	+115,929% ± 0.0%	+0.6% ± 0.7%	+115,929%
DroPE	+1,421% ± 0.0%	+0.2% ± 1.2%	+1,421%

Figure 3: Disruption Perplexity — *Figure 4: Perplexity after disruption (log scale). Zeroing massive values breaks RoPE (PPL 1 -> 1508) but only degrades DroPE (PPL 1.5 -> 23). Random controls cause negligible damage.*

Statistical validation:

We do some quick statistical tests because this is so cheap to do.

Paired t-test (massive vs random): p < 10⁻⁴⁸ for RoPE, p < 10⁻²⁹ for DroPE
Independent t-test (RoPE vs DroPE): p < 10⁻⁸⁷
Cohen's d > 1000

So I feel fairly confident that these results are significant!

Key ratio: RoPE relies 82× more on massive values than DroPE

Consistency Across Text Types

Text Type	RoPE PPL Increase	DroPE PPL Increase
Literary	+116,000%	+1,400%
Technical	+115,800%	+1,450%
Repetitive	+116,100%	+1,380%

Results are pretty consistent regardless of text content.

Interpretation

RoPE model: Zeroing massive values completely breaks the model. The model cannot function without these concentrated activations.

DroPE model: Zeroing massive values degrades but doesn't break the model. The model has learned alternative mechanisms that partially compensate.

Control condition: Zeroing random dimensions causes negligible damage in both models, proving massive values are specifically important, not just any high-norm dimensions.

Basically, both models have massive values, but RoPE is catastrophically dependent on them while DroPE is not.

Experiment 3: Parametric vs Contextual Knowledge

Background

Jin et al. (2025) demonstrate that massive value disruption affects tasks which use contextual knowledge far more than tasks which use parametric knowledge. We replicate their methodology on both RoPE and DroPE using the some of the same task categories:

Parametric Tasks (facts stored in model weights):

Cities (n=200): Yes/no factual statements ("Paris is in France")
Sports (n=100): Yes/no sports knowledge questions

Contextual Tasks (information extracted from input context):

IMDB (n=100): Sentiment classification from movie reviews
Passkey (n=20): Retrieve 5-digit number hidden in filler text

Disruption Method: We replace the top-1 massive dimension per head with the mean value.

Results

The results can be seen below.

Model	Category	Task	Baseline	Disrupted	Degradation
RoPE	parametric	cities	89.5%	57.5%	35.8%
RoPE	parametric	sports	60.0%	52.0%	13.3%
RoPE	contextual	imdb	44.0%	5.0%	88.6%
RoPE	contextual	passkey	100.0%	0.0%	100.0%
DroPE	parametric	cities	79.0%	88.5%	−12.0%
DroPE	parametric	sports	73.0%	67.0%	8.2%
DroPE	contextual	imdb	30.0%	15.0%	50.0%
DroPE	contextual	passkey	60.0%	60.0%	0.0%

Aggregated by category, it looks like:

*Figure 5: Degradation comparison across all tasks. RoPE shows significant contextual degradation, while DroPE is more robust.*

This is basically what we said in the summary: we replicate the results found in Jin et al., where RoPE is critical to contextual knowledge tasks but not parametric knowledge tasks. However, in DRoPE, contextual knowledge tasks are degraded way less, and parametric knowledge tasks seem to even improve slightly.

So massive values appear non-functional for information storage. The passkey task shows this best.

*Figure 6: Passkey retrieval results. RoPE completely collapses (100% degradation) while DroPE is entirely unaffected (0% degradation).*

This seems to demonstrate that DroPE's contextual retrieval mechanism is entirely independent of massive values, and the model has learned alternative attention patterns that don't rely on value concentration.

Interpretation

How should we interpret these results? The DroPE improvement under disruption suggests massive values are vestigial artifacts from the original RoPE training. To recap the evidence:

During RoPE pretraining, massive values encode positional information
DroPE removes positional embeddings and recalibrates
Massive values persist structurally but lose their function
They may actually create interference, explaining the improvement when disrupted

*Figure 7. Average degradation by knowledge category. RoPE's contextual knowledge degrades 3.8× more than parametric.*

*Figure 8. Baseline vs disrupted accuracy for all tasks. Note DroPE's Cities accuracy actually improves under disruption.*

*Figure 9: Combined summary of replicating some of Jin et al.'s results.*

Resolving the Contradiction

The apparent contradiction between the papers dissolves once we distinguish where massive values come from versus how they're used:

Massive values are learned into weights during RoPE training, as opposed to being created by RoPE at inference. The projection matrices W_Q and W_K develop these concentration patterns because RoPE's frequency structure during training creates gradients that favor certain dimensions.
RoPE at inference makes massive values functionally critical. The rotation operation couples these concentrated features to position-dependent attention patterns. Remove the rotation, and the model breaks because the model doesn't know how to use them without positional modulation.
DroPE enables longer contexts. Our findings suggest a mechanism:
1. RoPE concentrates attention in specific dimensions via massive values
2. This concentration may create bottlenecks at long contexts
3. DroPE distributes attention more evenly, potentially enabling better generalization to longer sequences

Why did I do this?

Understanding how and why large language models work in a principled way will require knowing the internal mechanisms of the transformer stack very deeply. While many components (such as attention, MLPs, and residual connections) are now relatively well studied, positional encoding remains surprisingly opaque (at least, to me). In particular, Rotary Positional Embeddings (RoPE) is both weird and brittle: they strongly shape attention behavior, impose hard-to-reason-about constraints on context length, and interact nontrivially with model scaling, quantization, and alignment.

I find RoPE is often wonky to work with, where small changes in frequency scaling or context length can produce disproportionate failures, and extending context reliably seems like it will require delicate engineering. Also, I had a free hour, Claude Code, and reread Yoav's paper while at the gym earlier this morning.

Limitations and Future Work

Models tested: We examined only Llama-2-7B, as it was the largest DroPE model mentioned in the paper. Also, I can't find the other Sakana models on HuggingFace. Larger models and different architectures may show different patterns.

Recalibration dynamics: We compared endpoints (RoPE vs. fully recalibrated DroPE). Tracking massive values during recalibration would reveal how redistribution occurs.

What is actually going on? I'm so confused. I'd like a nicer, more clear mechanistic explanation of what's actually happening. As in, the actual computations.

Reproducibility

Code

All experiments can be reproduced here:

# Massive value comparison
python scripts/run_massive_values_rigorous.py

# Disruption experiment
python scripts/run_disruption_rigorous.py

Hardware

GPU: NVIDIA A10G (24GB)
Models loaded in 4-bit quantization

Parameters

Parameter	Value	Source
λ (massive threshold)	5.0	Jin et al. 2025
Sequence length	512 tokens	Standard
Number of text samples	10	Diverse corpus
Number of random seeds	10	Statistical validation

Citation

If you use these findings, please cite:

@article{jin2025massive,
  title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
  author={Jin, Mingyu and others},
  journal={ICML},
  year={2025}
}

@article{gelberg2025drope,
  title={Dropping Positional Embeddings for Zero-Shot Long-Context Extension},
  author={Gelberg, Tal and others},
  journal={arXiv preprint arXiv:2512.12167},
  year={2025}
}

@techreport{africa2026massive,
    title     = {Massive Activations in DroPE: Evidence for Attention Reorganization},
    author    = {Africa, David},
    year      = {2026},
    url       = {https://github.com/DavidDemitriAfrica/drope-activations}
  }

LESSWRONG
LW

LESSWRONG
LW

19

Massive Activations in DroPE: Evidence for Attention Reorganization

19

Summary

Background

What Are Massive Values?

What Is DroPE?

Experiment 1: Massive Value Comparison

Methodology

Results

Interpretation

Experiment 2: Disruption Experiment

Motivation

Methodology

Results

Consistency Across Text Types

Interpretation

Experiment 3: Parametric vs Contextual Knowledge

Background

Results

Interpretation

Resolving the Contradiction

Why did I do this?

Limitations and Future Work

Reproducibility

Code

Hardware

Parameters

Citation

19

19