Large Languag Models (LLMs) are defined by their scale. This scale is the source of their power, but it's also their fundametal weakness, creating two massive bottlenecks that gatekeep their use:

The Training Bottleneck: For smaller companies, research labs, or even entire nations, the cost of training a state-of-the-art model from scratch is prohibitive.
The Inference Bottleneck: For the end-user, running a powerful model on a smartphone, laptop, or other edge device is choked by memory and computational limits.

The field of model compression has emerged to tackle these problems. It's a rich and compex domain, and it's helpful to categorize the different approaches.

Much of the work in this area is brilliant, but I believe the most robust and generalizable solutions will come from a first-principles understanding of the systems we are trying to compress. This post outlines my perspective on this, starting with the training problem and then focusing onour current work in the inference problem.

I. The Training Problem: Compression as Communication

Decentralized training, where multiple, non-co-located GPUs (or organizations) collaborate, is a promising path to democratizing LLM training. The primary bottleneck here isn't just compute; it's communication. How do you efficiently share weights and gradients over a noisy, low-bandwidth network?

This is a classic problem from my home field of engineering. It's not just a "CS problem"; it's a source coding and channel coding problem.

A recent paper I deeply admire, which was first recommended by my birlliant mentee Tamaz from the Algoverse program, "Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism," is a perfect example of this "first-principles" taste. Instead of just adding heuristics, the authors apply classic mathematical frameworks to design a provably efficient communication protocol for distributed learning. This is the kind of work I believe in: it's elegant, built on a solid theoretical backbone, showed grounded empirical results, and provides a clear, generalizable path forward.

II. The Inference Problem: The KV Cache Nightmare

My current research focuses on the second bottleneck: inference on-device.

When you use an LLM for a long conversation, the primary memory hog isn't the model's weights (which are fixed); it's the Key-Value (KV) Cache. This is the "scratchpad" the model uses to remember the conversation's history. Every new token you add requires the model to store more and more activation data, and this cache can quickly grow to be several gigabytes, overwhelming any consumer device.

The standard approach to solving this is, understandably, a collection of clever engineering tricks. These include things like "drop the oldest tokens," "only keep the most 'important' tokens," or "use a smaller, distilled model to guess the cache." These methods often work, and I respect the ingenuity. However, I also think this type of approach can feel bit like "black-box-poking." It's hard to know why one trick works better than another, and they can feel like a "bunch of inconsistent hacks" that may not generalize. That said, I do believe they worth more than pure theoretical stories.

My philosophy is like this: We can't truly compress what we don't first understand.

Instead of just guessing what's important, we are building a thorough, deep understanding of the Transformer's attention mechanism first. Only then can we design a principled, mathematically-grounded compression algorithm based on the structures we find.

At the end of the day, we are trying to move from engineering tricks to applied science.

The approach is the basis for a practical algorithm we developed. It has recently achieved a high ranking in a "High-Precision Low-Latency Sparse Attention Algorithm" competition. I feel the method and the though process are worth sharing, even as we build on it. The current scores and timings are based on the competition's specific platform and LLM. I'm now working to generalize this algorithm, improve its robustness, and apply it to a wider range of models and datasets for a formal publication.

III. Methodology: An Interp-First, Information-Theoretic Approach

Our approach is grounded in a brief analysis of the attention mechanism's internal dynamics, guided by the principles of interpretability. The competition's highly constrained runtime and memory environment ruled out many complex, multi-stage compression techniques, forcing us to devise a solution that is both theortically sound and computationally frugal. Our key innovations are twofold: 1) using interpretability insights to perform head-specific feature selection, and 2) introducing a signal from the Value vectors, inspired by information theory, to compensate for the inherent limitations of Key-only approximations.

3.1 Background and Problem Definition

3.1.1 Transformer Self-Attention and the KV Cache Bottleneck

The core of the Transformer architecture is its self-attention mechanism, which enables the model to capture dependencies between any two positions in a sequence. For given Query (Q), Key (K), and Value (V) matrices, the computation is formally expressed as:

where $d_{k}$ is the dimension of the key vectors. During autoregressive decoding, to avoid re-computation, the $K$ and $V$ vectors of all previous tokens are cached, forming the KV Cache. For a sequence of length $N$ , the memory footprint of this cache is $O (N \cdot L \cdot H \cdot d)$ , where $L$ is the number of layers, $H$ is the number of heads, and $d$ is the hidden dimension. When processing long contexts, this cache becomes the primary bottleneck for both memory and latency.

3.1.2 Mathematical Modeling of Sparse Attention

To mitigate this, Sparse Attention aims to compute attention over a subset of the KV pairs. We can define a selection function $f$ to obtain a sparse index set $S$ , retaining only the most important KV pairs. The optimization objective is to minimze the distortion $D$ between the sparse attention output and the full attention output:

${min}_{S} D (A t t e n t i o n (Q, K, V), A t t e n t i o n^{'} (Q, K, V, S))$

where $S = f (Q, K, V)$ is the index set of selected KV Cache blocks. The competition's goal, maximizing model accuracy for a given block retention rate (sparsity), is a practical application of this optimization problem.

3.1.3 Review of Existing Sparsification Strategies

Existing KV Cache sparsification strategies primarily fall into two categories:

Static Policies: These methods use fixed, content-agnostic rules. For example, Sliding Window retains only the most recent $k$ tokens, while Attention Sink additionally keeps the very first few tokens. These methods are simple but "blind," assuming importance is solely tied to position.
Dynamic Policies: These methods decide which KV pairs to retain based on the input content. Methods like H2O and ArkVale use eviction strategies based on attention scores or other importance metrics. These are more effective but often introduces significant computational overhead.

3.1.4 The Quest Baseline: Query-Aware Upper-Bound Estimation

A recent and highly influential dynamic policy is Quest, which serves as the primary reference for our work and for many teams in the competition. Quest proposes an elegant and efficient method for query-aware sparsity.

Core Mechanism: Quest operates at a page/block granularity. For each block of Key vectors, it pre-computes and stores a compact summary: the element-wise minimum and maximum values across all Key vectors within that block, forming $k_{m i n}$ and $k_{m a x}$ vectors.
Criticality Estimation: During decoding, to estimate the importance of a block for a given query vector $q$ , Quest calculates an upper bound on the dot-product score any Key in this block could achieve. This is done by taking the channel-wise maximum of the products with the min and max vectors:
$S c o r e_{Q u e s t} (b, q) = \sum_{j = 1}^{d_{k}} max (q_{j} \cdot k_{min, j}^{(b)}, q_{j} \cdot k_{max, j}^{(b)})$
The Approximation Gap: While ingenious and fast, this method has a fundamental limitation: the upper bound can be a very loose approximation of the true maximum attention score within the block. It estimates the potnetial relevance under the best-case scenario that a single Key vector aligns perfectly with the query's preferences across all dimensions. It fails to capture two realities:
1. No single Key vector in the block might actually produce this score.
2. It completely ignores the Value (V) vectors, which, from an information-theoretic perspective, carry the actual "content" or "message" to be retrieved. A block might be highly relevant (high K-score) but contain redundant or low-value information (low V-magnitude), or vice-versa.

Our work begins by acknowledging the power of Quest's upper-bound estimation but seeks to compensate for this "approximation gap" by introducing a complementary signal based on information theory.

3.2 Our Method: An Evolutionary, Interpretability-Driven Approach

Our final algorithm was the culmination of a principled, evolutionary design process. Constrained by the competition's strict resource limits on edge devices, we started with a Quest-inspired baseline and iteratively refined it. This journey was guided by two core principles: 1) insights from interpretability and visual probes to create a more efficient representation of the Key vectors, and 2) concepts from information theory to re-introduce the importance signal from the Value vectors, which is lost in Key-only approaximations.

This process can be viewed through the lens of rate-distortion theory. Our goal is to design a lossy compression scheme for the KV Cache.

Source: The full, redundant KV Cache.
Encoder (prepare): This function's job is not merely to delete data, but to create a highly efficient "codeword" (KV_repre) for each cache block. A good code word summarizes the block's essential features using minimal bits.
Decoder (search): This function uses the query vector as a "ke" to rapidly retrieve the most relevant codewords, minimzing the "distortion" (difference in attention output) for a given "rate" (sparsity level).

Our evolution from V1 to V3 reflects a deepening understanding of how to design this optimal encoding scheme for practical scenarios.

3.2.1 The Starting Point: A Quest-inspired Baseline (V1)

Step 1: Adapting the Classic (The "Trail" Phase)

Our journey began by implementing a faithful version of the core idea in Quest. We represent each block of Key vectors with a single, full-dimensional Axis-Aligned Bounding Box (AABB).

prepare (V1): For each block $b$ and head $h$ compute the min/max vectors across all $d_{k}$ dimensions of the Keys.

$k_{min}^{(b, h)}, k_{max}^{(b, h)} = {aminmax}_{i \in b} (K^{i, h})$

# V1 `prepare` logic for a single block and head
def prepare_v1(key_vectors_in_block):
    # key_vectors_in_block has shape [block_size, head_dim]
    min_k = key_vectors_in_block.min(axis=0)
    max_k = key_vectors_in_block.max(axis=0)
    # K_repre is the concatenation of the two vectors
    k_repre = concat(min_k, max_k)
    return k_repre

search (V1): Use the Quest upper-bound score to rank blocks.

# V1 `search` logic for a single block and head
def search_v1(query_vector, k_repre):
    min_k, max_k = split(k_repre)
    # Element-wise product with both bounds
    prod1 = query_vector * min_k
    prod2 = query_vector * max_k
    # Take the max at each channel and sum to get the score
    upper_bound_score = max(prod1, prod2).sum()
    return upper_bound_score

The Flaw: This baseline performed poorly in our initial tests.

Our investigation, which involved visualizing attention patterns and feature activations, revealed why. For any given head, the tryly salient information is concentrated in a sparse subset of the $d_{k}$ dimensions:

Figure 1: Channel Importance Heatmap

By creating a bounding box over all dimensions, the "signal" from these few critical dimensional was drowned out by the "noise" from the many irrelevant ones, making the upper-bound score a poor predictor of true importance.

3.2.2 The First Breakthrough: Interpretability-Informed K-Representation (V2)

Step 2: Probing the KV Cache (The "Understand" Phase)

This failure led to our first key innovation: principled, head-specific feature selection before creating the bounding box. Specifically, if each head focuses on a sparse set of features, our bounding box should too. We must perform principled, head-specific feature selection before creating the bounding box.

Interpretability Insight: Our interpretability studies, visualizing feature importance across the head_dim for various attention heads, reavealed a striking pattern. The dimensions contributing most significantly to attention scores were consistently concentrated in specific speatral bands, notably within the ranges of approximately [25%, 50%] and [75%, 100%] of the feature vector. This observation (conceptualized in the heatmap in Figure 1) suggested that we could achieve a near 50% compression of the K-representation by focusing only on these information-dense regions.

Head Specialization Trick (TDM Analogy): A crucial secondary insight is that different heads specialize. To prevent all heads from collapsing onto the same feature subspace, we introduce a small, head-dependent shift to the sampling windows. This acts as a form of "feature division multiplexing," ensuring that across the model, a diverse range of features is monitored. This is analogous to cutting a high-resolution image into several lower-resolution pieces; while each piece is less detailed, the collection of pieces can reconstruct a more complete view, as shown in Figure 2.

Figure 2 Illustraction of TDM Analogy for Feature Selection

Hence comes our V2 algorithm as follows:

prepare (V2): We implement this insight by programmatically selecting a sub-vector of key dimensions for each head before computing the bounding box.

# V2 `prepare` logic for K-Representation
def prepare_v2(key_vectors_in_block, head_index, head_dim):
    # 1. Calculate a small, head-dependent shift
    shift_percent = (head_index % 4) * 0.02
    
    # 2. Define the start of the two sampling intervals
    start1 = int((0.25 + shift_percent) * head_dim)
    end1 = start1 + int(0.25 * head_dim)
    start2 = int((0.75 + shift_percent) * head_dim)
    end2 = start2 + int(0.25 * head_dim)

    # 3. Assemble the indices from these two intervals
    indices = concat(range(start1, end1), range(start2, end2))
    
    # 4. Gather the critical sub-dimensions from the original K vectors
    k_selected = key_vectors_in_block.gather(indices, axis=1)
    
    # 5. Compute min and max ONLY on these critical dimensions
    min_k_sub, max_k_sub = aminmax(k_selected, axis=0)
    
    # 6. The representation is the concatenation of these compact bounding boxes
    k_repre = concat(min_k_sub, max_k_sub)
    return k_repre

search (V2): The scroing logic is identical to V1, but now operates on the query's selected sub-vector and the compact bounding box. This yields a much tighter and more meaningful upper-bound score.

In another word, this is a form of feature selection. By encoding only the high-signal dimensions identified via our heatmap analysis, we dramatically improve the signal-to-noise (SNR) ratio of our "codeword," leading to a much more efficient representation.

Performance improved greatly. We had successfully addressed the "noise" issue in the K-space. However, we were still only looking at one side of the attention equation. The system captured query-aware relevance (QK alignment) but remained blind to query-agnosic importance, which is encoded in the V vectors.

3.2.3 The Final Fusion: Information-Theoretic V-Score Compensation (V3)

Step 3: Principled Compression (The "Compress" Phase)

As we reflect, block might be highly relevant (high K-Score) but contain redundant or low-value information. To address this, we introduced a second signal from the Value vectors, representing a token's intrinsic, query-agnostic importance.

Our initial thought was that the "energy" or "information density" of a token is captured by the norm of its V vector. We first tried using the peak energy, max(norm(V)), within a block as a signal. However, this proved too "spiky" and unstable. A much more robust and stable measure of a block's overall information content was its total energy: sum(norm(V)).

Our final algorithm fuses the query-aware K-Score (relevance) with this query-agnostic V-Score for importance.

K-Score (Relevance): This is calculated as in V2, using the refined, interpretability-informed K-representation. It is an efficient approximation of the dot product, checking if the query's key dimensions fall within the block's "bounding box."
$S c o r e_{K} (b, h) = \sum_{j \in I_{h}} max (q_{j} \cdot k_{min, j}^{(b, h)}, q_{j} \cdot k_{max, j}^{(b, h)})$
V-Score (Prior Importance): We represent the block's total information content by summing the L2 norms of all its V vectors. This measures how much "content" the block contributes overall.
$V_{s c o r e}^{(b, h)} = \sum_{i \in b} ∥ v_{i}^{(h)} ∥_{2}$
We then use Softmax to convert these absolute energy scores into a distribution of relative importance weights across all blocks.
$W_{V} (b, h) = \frac{exp (V_{s c o r e}^{(b, h)})}{\sum_{b^{'}} exp (V_{s c o r e}^{(b^{'}, h)})}$
Final Score Fusion: The final score is a multiplicative fusion. The V-Weight acts as a "booster," amplifying the score of blocks that are not only relevant to the query but also intrinsically important to the context.

# V103 `search` logic for the final score
def search_v103(query_sub, k_repre, v_weight, alpha=0.3):
    # 1. K-Score (Content Relevance)
    min_k_sub, max_k_sub = split(k_repre)
    score_k = max(query_sub * min_k_sub, query_sub * max_k_sub).sum()
    
    # 2. V-Score (Prior Importance) is pre-calculated as v_weight
    
    # 3. Final Score (Fusion)
    final_score = score_k * (1 + alpha * v_weight)
    return final_score

For clarity, he mathematical formulation is:

$F i n a l S c o r e (b, h) = S c o r e_{K} (b, h) \cdot (1 + α \cdot W_{V} (b, h))$

Positional Safety Net: Finally, our implementation implicitly boosts the scores of the very first block (the "Attention Sink") and the most recent blocks, incorporating the wisdom of static policies to guarantee foundational context integrity. This complete algorithm—combining K's fine-grained encoding with V's energy-awareness—forms the core of our winning solution.

IV. Future Work and Broader Impact

Our work on the V3 algorithm, with its novel K-V hybrid scoring, represents a distinct and principled approach to KV Cache sparsification. By analyzing our method in the context of other top-performing solutions from the competition, we can identify key learnings, outline promising research directions, and evision the broader technical impact, you know, to make the most of this work.

4.1 Comparative Analysis and Lessons Learned

A comparison with the other two top-ranking solutions (referred to as "Team A" and "Team B") reveals quite different, yet potentially complementary approaches to this compression problem.

Dimension	Team A	Team B	Our Method (V3)
Core Scoring Logic	Standard Quest Logic: `max(QK_min, QK_max)`	Standard Quest Logic: `max(QK_min, QK_max)`	Hybrid Logic: `K_Score * (1 + alpha * V_Score_softmax)`
GQA Handling	Expanded KV representation to match the number of Q heads.	Expanded and reshaped KV representation.	Reshaped Q heads to match the number of KV heads.
Heuristic Tuning	Adaptive sparsity ratio; mandatory keep for first/last blocks.	Heavy manual tuning: fixed multipliers ("magic numbers," e.g., `1.5`, `0.8`) for specific block positions.	Principled Heuristic: V-Score influence is globally controlled by a single, interpretable `alpha` parameter; mandatory keep for first/last blocks.

Our Unique Value: Our primary contribution is the pioneering of a K-V hybrid paradigm. We were the only team to formally introduce the V-tensor as an independent, query-agnostic information source to refine the relevance-based score. Furthermore, our V-Score modulation, governed by a global alpha and a Softmax distribution, is a more elegant and generalizable framework than the hard-coded "magic numbers" used in other solutions. It provides a more interpretable and robust method for incorporating prior importance.

Self-Critique and Key Learnings to Our Limitations:

Robustness of K-Representation: Our visualization-driven slicing of K-vectors, while effective, is theoretically more susceptible to noise and outliers than the K-Means clustering approach likely used by Team A. A K-Means centroid provides a more robust summary of a vector cluster than a simple min/max bounding box on a subset of dimensions.
Quantization Engineering Gap: We forced on dimensionality reduction (feature selection) as our primary compression strategy. The top solutions appear to have invested heavily in sophisticated quantization schemes. When memory constraints become extremely severe, elegant quantization (e.g., to INT4) can be a superior strategy as it retains full-dimensional information, albeit at a lower precision. This may be more beneficial than discarding dimensions entirely.

Conclusion: Our algorithm is a highly effective information-encoding and feature-engineering solution. The top-performing solutions demonstrate the power of advanced quantization techniques and meticulous engineering tuning. These two directions are not mutually exclusive. In fact, their combination points toward a more powerful, unified approach.

4.2 Promising Directions for Future Research

The insights gained suggest several avenues for future work:

Hybrid Representation and Quantization: The most immediate path forward is to combine the strengths of our approach with others. This involves replacing our min/max bounding box on selected dimensions with a more robust summary, such as K-Means centroids, for the K-representation. Subsequently, this more robust, lower-dimensional KV_repre could be subjected to aggressive quantization (e.g., INT8 or INT4) to achieve a multiplicative effect on compression, merging the benefits of feature selection and precision reduction.
Learned and Dynamic Heuristics: Our use of a fixed alpha and hard-coded feature selection windows can be evolved, especially by better interpretability probes beyond simple visualization. Future work could explore learning the optimal feature windows for K-representation per layer or even dynamically based on the input. Similarly, the alpha parameter, which balances relevance and importance, could be made dynamic or even predictable by a small, auxiliary neural network, allowing the model to adapt its caching strategy on-the-fly.
Exploring Alternative Information-Theoretic Measures: Our V-Score is based on the L2 norm (total energy). This can be expanded by exploring other information-theoretic metrics for a block's importance. For instance, the entropy or variance of V-vectors within a block could serve as a proxy for information richness. A block where all V-vectors are very similar (low variance) may be less informative than a block with high diversity.

4.3 Potential for Real-World Impact

Beyond the competition, this line of research has significant potential for technological and commercial application, e.g., empowering edge AI, boosting cloud AI throughput, foundation for new model architectures, etc. From my perspective, this research validates two essential ideas: the KV Cache is highly compressible and manageable data structure, and classific information and coding theory can provide lens for optimizing the LLM's components. This could inspire cache-aware Transformer architectures, taking extra components from communication and information (e.g., channel, interference) into consideration. This can make efficient inference a feature of the model design, instead of just a post-hoc optimization - Maybe can further develop richer collaboration between LLMs and EE communities.

The "So What?" About Mech Interp & Call for Collaboration

I am constantly asked about the practical applicability of mechanistic interpretability. "What's the point of finding circuits," people ask, "if it doesn't help us build better models?"

This is the point, or at least one of them.

Mech interp is not just an "AI biology" exercise for its own sake. It is the diagnostic toolkit that allows us to be better, more principled engineers. It's the "MRI machine" that lets the "AI doctor" see what's the targeted organism for the function before attempting sugery.

I'm actively looking for collaborators and very open to suggestions. If this approach to compression excites you, please get in touch.

P.S. A Metaphysical Coda

The philosophical analogy encountered me when I was developing this algorithm and, inspired me in an unexpected way, and I ended up drafting a short science fiction novel exploring its core concepts and expressing recent experience feeling the industry work. If you're interested, you can have a glance here: "Science Fiction: The Compressed Universe"

LESSWRONG
LW