Large Languag Models (LLMs) are defined by their scale. This scale is the source of their power, but it's also their fundametal weakness, creating two massive bottlenecks that gatekeep their use:
The field of model compression has emerged to tackle these problems. It's a rich and compex domain, and it's helpful to categorize the different approaches.
Much of the work in this area is brilliant, but I believe the most robust and generalizable solutions will come from a first-principles understanding of the systems we are trying to compress. This post outlines my perspective on this, starting with the training problem and then focusing onour current work in the inference problem.
Decentralized training, where multiple, non-co-located GPUs (or organizations) collaborate, is a promising path to democratizing LLM training. The primary bottleneck here isn't just compute; it's communication. How do you efficiently share weights and gradients over a noisy, low-bandwidth network?
This is a classic problem from my home field of engineering. It's not just a "CS problem"; it's a source coding and channel coding problem.
A recent paper I deeply admire, which was first recommended by my birlliant mentee Tamaz from the Algoverse program, "Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism," is a perfect example of this "first-principles" taste. Instead of just adding heuristics, the authors apply classic mathematical frameworks to design a provably efficient communication protocol for distributed learning. This is the kind of work I believe in: it's elegant, built on a solid theoretical backbone, showed grounded empirical results, and provides a clear, generalizable path forward.
My current research focuses on the second bottleneck: inference on-device.
When you use an LLM for a long conversation, the primary memory hog isn't the model's weights (which are fixed); it's the Key-Value (KV) Cache. This is the "scratchpad" the model uses to remember the conversation's history. Every new token you add requires the model to store more and more activation data, and this cache can quickly grow to be several gigabytes, overwhelming any consumer device.
The standard approach to solving this is, understandably, a collection of clever engineering tricks. These include things like "drop the oldest tokens," "only keep the most 'important' tokens," or "use a smaller, distilled model to guess the cache." These methods often work, and I respect the ingenuity. However, I also think this type of approach can feel bit like "black-box-poking." It's hard to know why one trick works better than another, and they can feel like a "bunch of inconsistent hacks" that may not generalize. That said, I do believe they worth more than pure theoretical stories.
My philosophy is like this: We can't truly compress what we don't first understand.
Instead of just guessing what's important, we are building a thorough, deep understanding of the Transformer's attention mechanism first. Only then can we design a principled, mathematically-grounded compression algorithm based on the structures we find.
At the end of the day, we are trying to move from engineering tricks to applied science.
The approach is the basis for a practical algorithm we developed. It has recently achieved a high ranking in a "High-Precision Low-Latency Sparse Attention Algorithm" competition. I feel the method and the though process are worth sharing, even as we build on it. The current scores and timings are based on the competition's specific platform and LLM. I'm now working to generalize this algorithm, improve its robustness, and apply it to a wider range of models and datasets for a formal publication.
Our approach is grounded in a brief analysis of the attention mechanism's internal dynamics, guided by the principles of interpretability. The competition's highly constrained runtime and memory environment ruled out many complex, multi-stage compression techniques, forcing us to devise a solution that is both theortically sound and computationally frugal. Our key innovations are twofold: 1) using interpretability insights to perform head-specific feature selection, and 2) introducing a signal from the Value vectors, inspired by information theory, to compensate for the inherent limitations of Key-only approximations.
The core of the Transformer architecture is its self-attention mechanism, which enables the model to capture dependencies between any two positions in a sequence. For given Query (Q), Key (K), and Value (V) matrices, the computation is formally expressed as:
where is the dimension of the key vectors. During autoregressive decoding, to avoid re-computation, the and vectors of all previous tokens are cached, forming the KV Cache. For a sequence of length , the memory footprint of this cache is , where is the number of layers, is the number of heads, and is the hidden dimension. When processing long contexts, this cache becomes the primary bottleneck for both memory and latency.
To mitigate this, Sparse Attention aims to compute attention over a subset of the KV pairs. We can define a selection function to obtain a sparse index set , retaining only the most important KV pairs. The optimization objective is to minimze the distortion between the sparse attention output and the full attention output:
where is the index set of selected KV Cache blocks. The competition's goal, maximizing model accuracy for a given block retention rate (sparsity), is a practical application of this optimization problem.
Existing KV Cache sparsification strategies primarily fall into two categories:
A recent and highly influential dynamic policy is Quest, which serves as the primary reference for our work and for many teams in the competition. Quest proposes an elegant and efficient method for query-aware sparsity.
Criticality Estimation: During decoding, to estimate the importance of a block for a given query vector , Quest calculates an upper bound on the dot-product score any Key in this block could achieve. This is done by taking the channel-wise maximum of the products with the min and max vectors:
Our work begins by acknowledging the power of Quest's upper-bound estimation but seeks to compensate for this "approximation gap" by introducing a complementary signal based on information theory.
Our final algorithm was the culmination of a principled, evolutionary design process. Constrained by the competition's strict resource limits on edge devices, we started with a Quest-inspired baseline and iteratively refined it. This journey was guided by two core principles: 1) insights from interpretability and visual probes to create a more efficient representation of the Key vectors, and 2) concepts from information theory to re-introduce the importance signal from the Value vectors, which is lost in Key-only approaximations.
This process can be viewed through the lens of rate-distortion theory. Our goal is to design a lossy compression scheme for the KV Cache.
prepare): This function's job is not merely to delete data, but to create a highly efficient "codeword" (KV_repre) for each cache block. A good code word summarizes the block's essential features using minimal bits.search): This function uses the query vector as a "ke" to rapidly retrieve the most relevant codewords, minimzing the "distortion" (difference in attention output) for a given "rate" (sparsity level).Our evolution from V1 to V3 reflects a deepening understanding of how to design this optimal encoding scheme for practical scenarios.
Step 1: Adapting the Classic (The "Trail" Phase)
Our journey began by implementing a faithful version of the core idea in Quest. We represent each block of Key vectors with a single, full-dimensional Axis-Aligned Bounding Box (AABB).
prepare (V1): For each block and head compute the min/max vectors across all dimensions of the Keys.
# V1 `prepare` logic for a single block and head
def prepare_v1(key_vectors_in_block):
# key_vectors_in_block has shape [block_size, head_dim]
min_k = key_vectors_in_block.min(axis=0)
max_k = key_vectors_in_block.max(axis=0)
# K_repre is the concatenation of the two vectors
k_repre = concat(min_k, max_k)
return k_represearch (V1): Use the Quest upper-bound score to rank blocks.
# V1 `search` logic for a single block and head
def search_v1(query_vector, k_repre):
min_k, max_k = split(k_repre)
# Element-wise product with both bounds
prod1 = query_vector * min_k
prod2 = query_vector * max_k
# Take the max at each channel and sum to get the score
upper_bound_score = max(prod1, prod2).sum()
return upper_bound_scoreOur investigation, which involved visualizing attention patterns and feature activations, revealed why. For any given head, the tryly salient information is concentrated in a sparse subset of the dimensions:
Figure 1: Channel Importance Heatmap
By creating a bounding box over all dimensions, the "signal" from these few critical dimensional was drowned out by the "noise" from the many irrelevant ones, making the upper-bound score a poor predictor of true importance.
Step 2: Probing the KV Cache (The "Understand" Phase)
This failure led to our first key innovation: principled, head-specific feature selection before creating the bounding box. Specifically, if each head focuses on a sparse set of features, our bounding box should too. We must perform principled, head-specific feature selection before creating the bounding box.
Interpretability Insight: Our interpretability studies, visualizing feature importance across the head_dim for various attention heads, reavealed a striking pattern. The dimensions contributing most significantly to attention scores were consistently concentrated in specific speatral bands, notably within the ranges of approximately [25%, 50%] and [75%, 100%] of the feature vector. This observation (conceptualized in the heatmap in Figure 1) suggested that we could achieve a near 50% compression of the K-representation by focusing only on these information-dense regions.
Head Specialization Trick (TDM Analogy): A crucial secondary insight is that different heads specialize. To prevent all heads from collapsing onto the same feature subspace, we introduce a small, head-dependent shift to the sampling windows. This acts as a form of "feature division multiplexing," ensuring that across the model, a diverse range of features is monitored. This is analogous to cutting a high-resolution image into several lower-resolution pieces; while each piece is less detailed, the collection of pieces can reconstruct a more complete view, as shown in Figure 2.
Figure 2 Illustraction of TDM Analogy for Feature Selection
Hence comes our V2 algorithm as follows:
prepare (V2): We implement this insight by programmatically selecting a sub-vector of key dimensions for each head before computing the bounding box.
# V2 `prepare` logic for K-Representation
def prepare_v2(key_vectors_in_block, head_index, head_dim):
# 1. Calculate a small, head-dependent shift
shift_percent = (head_index % 4) * 0.02
# 2. Define the start of the two sampling intervals
start1 = int((0.25 + shift_percent) * head_dim)
end1 = start1 + int(0.25 * head_dim)
start2 = int((0.75 + shift_percent) * head_dim)
end2 = start2 + int(0.25 * head_dim)
# 3. Assemble the indices from these two intervals
indices = concat(range(start1, end1), range(start2, end2))
# 4. Gather the critical sub-dimensions from the original K vectors
k_selected = key_vectors_in_block.gather(indices, axis=1)
# 5. Compute min and max ONLY on these critical dimensions
min_k_sub, max_k_sub = aminmax(k_selected, axis=0)
# 6. The representation is the concatenation of these compact bounding boxes
k_repre = concat(min_k_sub, max_k_sub)
return k_represearch (V2): The scroing logic is identical to V1, but now operates on the query's selected sub-vector and the compact bounding box. This yields a much tighter and more meaningful upper-bound score.In another word, this is a form of feature selection. By encoding only the high-signal dimensions identified via our heatmap analysis, we dramatically improve the signal-to-noise (SNR) ratio of our "codeword," leading to a much more efficient representation.
Performance improved greatly. We had successfully addressed the "noise" issue in the K-space. However, we were still only looking at one side of the attention equation. The system captured query-aware relevance (QK alignment) but remained blind to query-agnosic importance, which is encoded in the V vectors.
Step 3: Principled Compression (The "Compress" Phase)
As we reflect, block might be highly relevant (high K-Score) but contain redundant or low-value information. To address this, we introduced a second signal from the Value vectors, representing a token's intrinsic, query-agnostic importance.
Our initial thought was that the "energy" or "information density" of a token is captured by the norm of its V vector. We first tried using the peak energy, max(norm(V)), within a block as a signal. However, this proved too "spiky" and unstable. A much more robust and stable measure of a block's overall information content was its total energy: sum(norm(V)).
Our final algorithm fuses the query-aware K-Score (relevance) with this query-agnostic V-Score for importance.
K-Score (Relevance): This is calculated as in V2, using the refined, interpretability-informed K-representation. It is an efficient approximation of the dot product, checking if the query's key dimensions fall within the block's "bounding box."
V-Score (Prior Importance): We represent the block's total information content by summing the L2 norms of all its V vectors. This measures how much "content" the block contributes overall.
We then use Softmax to convert these absolute energy scores into a distribution of relative importance weights across all blocks.
# V103 `search` logic for the final score
def search_v103(query_sub, k_repre, v_weight, alpha=0.3):
# 1. K-Score (Content Relevance)
min_k_sub, max_k_sub = split(k_repre)
score_k = max(query_sub * min_k_sub, query_sub * max_k_sub).sum()
# 2. V-Score (Prior Importance) is pre-calculated as v_weight
# 3. Final Score (Fusion)
final_score = score_k * (1 + alpha * v_weight)
return final_scoreFor clarity, he mathematical formulation is:
Positional Safety Net: Finally, our implementation implicitly boosts the scores of the very first block (the "Attention Sink") and the most recent blocks, incorporating the wisdom of static policies to guarantee foundational context integrity. This complete algorithm—combining K's fine-grained encoding with V's energy-awareness—forms the core of our winning solution.
Our work on the V3 algorithm, with its novel K-V hybrid scoring, represents a distinct and principled approach to KV Cache sparsification. By analyzing our method in the context of other top-performing solutions from the competition, we can identify key learnings, outline promising research directions, and evision the broader technical impact, you know, to make the most of this work.
A comparison with the other two top-ranking solutions (referred to as "Team A" and "Team B") reveals quite different, yet potentially complementary approaches to this compression problem.
| Dimension | Team A | Team B | Our Method (V3) |
|---|---|---|---|
| Core Scoring Logic | Standard Quest Logic: max(Q*K_min, Q*K_max) | Standard Quest Logic: max(Q*K_min, Q*K_max) | Hybrid Logic: K_Score * (1 + alpha * V_Score_softmax) |
| GQA Handling | Expanded KV representation to match the number of Q heads. | Expanded and reshaped KV representation. | Reshaped Q heads to match the number of KV heads. |
| Heuristic Tuning | Adaptive sparsity ratio; mandatory keep for first/last blocks. | Heavy manual tuning: fixed multipliers ("magic numbers," e.g., *1.5, *0.8) for specific block positions. | Principled Heuristic: V-Score influence is globally controlled by a single, interpretable alpha parameter; mandatory keep for first/last blocks. |
Our Unique Value: Our primary contribution is the pioneering of a K-V hybrid paradigm. We were the only team to formally introduce the V-tensor as an independent, query-agnostic information source to refine the relevance-based score. Furthermore, our V-Score modulation, governed by a global alpha and a Softmax distribution, is a more elegant and generalizable framework than the hard-coded "magic numbers" used in other solutions. It provides a more interpretable and robust method for incorporating prior importance.
Self-Critique and Key Learnings to Our Limitations:
Conclusion: Our algorithm is a highly effective information-encoding and feature-engineering solution. The top-performing solutions demonstrate the power of advanced quantization techniques and meticulous engineering tuning. These two directions are not mutually exclusive. In fact, their combination points toward a more powerful, unified approach.
The insights gained suggest several avenues for future work:
KV_repre could be subjected to aggressive quantization (e.g., INT8 or INT4) to achieve a multiplicative effect on compression, merging the benefits of feature selection and precision reduction.alpha and hard-coded feature selection windows can be evolved, especially by better interpretability probes beyond simple visualization. Future work could explore learning the optimal feature windows for K-representation per layer or even dynamically based on the input. Similarly, the alpha parameter, which balances relevance and importance, could be made dynamic or even predictable by a small, auxiliary neural network, allowing the model to adapt its caching strategy on-the-fly.Beyond the competition, this line of research has significant potential for technological and commercial application, e.g., empowering edge AI, boosting cloud AI throughput, foundation for new model architectures, etc. From my perspective, this research validates two essential ideas: the KV Cache is highly compressible and manageable data structure, and classific information and coding theory can provide lens for optimizing the LLM's components. This could inspire cache-aware Transformer architectures, taking extra components from communication and information (e.g., channel, interference) into consideration. This can make efficient inference a feature of the model design, instead of just a post-hoc optimization - Maybe can further develop richer collaboration between LLMs and EE communities.
I am constantly asked about the practical applicability of mechanistic interpretability. "What's the point of finding circuits," people ask, "if it doesn't help us build better models?"
This is the point, or at least one of them.
Mech interp is not just an "AI biology" exercise for its own sake. It is the diagnostic toolkit that allows us to be better, more principled engineers. It's the "MRI machine" that lets the "AI doctor" see what's the targeted organism for the function before attempting sugery.
I'm actively looking for collaborators and very open to suggestions. If this approach to compression excites you, please get in touch.
P.S. A Metaphysical Coda
The philosophical analogy encountered me when I was developing this algorithm and, inspired me in an unexpected way, and I ended up drafting a short science fiction novel exploring its core concepts and expressing recent experience feeling the industry work. If you're interested, you can have a glance here: "Science Fiction: The Compressed Universe"