This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Standard language model evaluation has a structural problem that I do not think gets enough attention. The problem is not that our metrics are imprecise. It is that they are measuring something different from what we actually care about.
This post introduces a framework I have been developing called Coverage, describes the empirical methodology and results from an evaluation study on abstractive summarisation, and outlines a failure taxonomy that I think is more useful for diagnosing model behaviour than current approaches. I am preparing a series of formal papers. This post is an informal introduction to the core ideas for readers who want to engage with them before the papers are finalised.
The aggregate metric problem
When we evaluate a language model on a task, we typically compare its output to a reference using a scalar metric. For summarisation, ROUGE-L measures string overlap between generated and reference summaries. For question answering, accuracy measures whether the final answer matched. For translation, BLEU measures n-gram overlap.
These metrics have a shared structural property: they aggregate across all semantic dimensions simultaneously. Every type of information the model was supposed to preserve or produce gets collapsed into a single number. This is convenient. It is also, I want to argue, the source of systematic evaluation failures.
Consider two models evaluated on the same summarisation task. Their ROUGE-L scores differ by less than one percentage point. Standard practice calls them equivalent and moves on.
But what if they are not equivalent? What if they preserve fundamentally different types of information, and the aggregate metric is averaging over those differences in a way that makes them invisible?
This is not a hypothetical. It is what I found.
What Coverage measures
The Coverage framework approaches evaluation differently. Rather than asking how similar the output is to a reference overall, it asks: for each semantic dimension you care about, how much of the source-side informational structure of that dimension survives transmission through the model?
The dimensions I tracked in the study were: named entities (Person, Location), numerical quantities, monetary figures, dates, and logical negation. These are not arbitrary — they represent types of information that matter for factual accuracy in news summarisation and that have clear, operationalisable definitions.
For each dimension, Coverage Ratio (or simply Coverage) estimates the mutual information between dimensional presence in the source document and dimensional preservation in the generated summary, normalized by the source-side entropy of that dimension:
Where
is a binary variable indicating whether dimension is present in the source,
is a binary variable indicates whether it is preserved in the output.
is the mutual information between the source (input) and the target (output). It measures how strongly output-side preservation depends on source-side dimensional presence.
is the entropy of the source-side dimension.
The Coverage Ratio takes values in the range . A value near 1 indicates that preservation closely tracks the underlying dimensional structure of the source distribution, whereas values near 0 indicate weak or statistically independent preservation behavior. Intuitively, the metric asks: given the amount of informational variability present in the source dimension, how much of that variability remains recoverable after transmission through the channel?
This is worth unpacking carefully, because it is easy to misread what this measures. The Coverage Ratio is not a recall measure. It does not tell you what fraction of monetary entities were preserved.
Consider this. In a corpus of source documents in 95% of cases a person entity occurrs. The uncertainty about occurrences in the source documents is low regarding this particular dimension. This implies that for a communication channel the task to preserve this dimension is relatively easy to fulfill. Compare this now to a situation where a person entity occurrs in 50% of source documents. Uncertainty on the presence of a person entity in a particular source document is very high. This implies that for a communication channel the task to preserve this dimension is much harder to fulfill.
The entropy normalization matters. A dimension that appears in 95% of source documents has low entropy, i.e. there is not much uncertainty about whether it will be present. A model that always includes content of that type will score well on raw recall but the Coverage Ratio will be modest, because the task was not informationally demanding. A dimension that appears in 50% of documents has maximum entropy, i.e. genuine uncertainty about presence. High Coverage Ratio here indicates the model is faithfully tracking that uncertainty, which is harder and more meaningful.
The empirical results
I evaluated two publicly available models — BART-large-cnn and DistilBART-cnn-6-6— on 1000 articles from the CNN/DailyMail test set. DistilBART is a knowledge-distilled version of BART, specifically designed to approximate BART's performance at lower computational cost, trained with the same CNN/DailyMail fine-tuning objective.
The aggregate ROUGE-L scores:
BART-large-cnn: 0.240
DistilBART-cnn-6-6: 0.232
A difference of 0.008, i.e. 8 percentage points. Standard benchmarking calls these systems equivalent.
The Coverage Ratio profiles tell a different story. Selected results for the plain prompt condition (no additional instructions):
Dimension
BART-large-cnn
DistilBART-cnn-6-6
Person
0.238
0.266
Location
0.227
0.251
Money
0.210
0.293
Date
0.185
0.230
Numerical
0.273
0.306
Negation
0.107
0.120
DistilBART consistently achieves higher Coverage Ratio across all dimensions despite lower ROUGE-L. The gap on the Money dimension is particularly notable — DistilBART transmits substantially more of the source-side informational structure of monetary content than BART does.
The Spearman rank correlation between ROUGE-L score and Coverage Ratio across all test configurations is negative for four of the six dimensions, with the Location, Money, Date, and Numerical correlations reaching statistical significance. Higher ROUGE-L is associated with lower dimensional Coverage (see figure 1). The models that score better on the standard metric are preserving less of the specific semantic content.
Figure 1: ROUGE-L scores versus Coverage Ratio for BART-large-cnn and DistilBART-cnn-6-6 models
This is not a sampling artifact. It reflects something real about how BART-large achieves its higher ROUGE-L score. BART produces more abstractive, paraphrased summaries that are stylistically closer to human-written references. This drives ROUGE-L up. But the paraphrasing process frequently omits or transforms specific factual content — numbers, dates, monetary figures. DistilBART's more extractive behaviour produces outputs that are stylistically less similar to references but factually more faithful. The aggregate metric systematically rewards the wrong model for the wrong reasons in any application where factual preservation matters.
Prompting reallocates rather than expands
The study also evaluated both models under three additional prompt conditions: a general coverage prompt instructing the model to preserve key information, a numbers-focused prompt, and a negation-focused prompt.
The results reveal something important about how prompting interacts with model capacity.
For BART-large, targeted prompting produces measurable Coverage improvements on specific dimensions. The numbers prompt significantly improves Numerical Coverage (p = 0.005). Person and Date dimensions improve under any prompt modification, including the negation prompt — suggesting these dimensions are correlated enough in news text that any attentional shift benefits them.
But the improvements come at a cost. When one dimension's Coverage increases, others tend to decrease. The total Coverage across dimensions does not expand meaningfully under any prompt condition. The model is operating under a fixed informational capacity, and prompting shifts how that capacity is allocated rather than increasing it.
This has a direct practical implication. When you prompt a model to "preserve all numerical quantities," you are not making it a better model. You are making a tradeoff decision — trading Coverage on other dimensions for improved numerical Coverage. If you do not measure at the dimensional level, you will not see what you gave up.
The failure taxonomy
Coverage measurement tells you what failed. Diagnosing why it failed — and what would actually fix it — requires a different kind of analysis.
I have been developing a failure taxonomy derived from the formal components of the communication process. The goal is to produce a classification where each failure type has a distinct engineering remediation, not just a distinct name.
The most practically important distinctions:
Where in the pipeline did the failure occur? A failure can happen before the model receives the input, within the model's processing, or in the evaluation methodology itself. These require completely different responses. The inability of language models to count letters in a word is almost universally discussed as a model capability limitation. It is not. The tokeniser, which operates before the transformer, merges letter sequences into subword tokens. The model cannot count what it was never given. No improvement to the transformer will fix a tokeniser-level failure.
Why does the failure resist prompting? When prompting does not fix a failure, there are two fundamentally different underlying reasons. The first is architectural: the model lacks the computational substrate required for this type of operation, and this is not fixable through training or instruction. The second is a training gap: the architecture has the relevant capacity in principle but was not trained on sufficient relevant examples. These look identical from the outside — neither responds to prompting — but the engineering responses are completely different in cost and kind.
Is the apparent failure actually a measurement failure? A non-trivial fraction of apparent model failures are failures of the evaluation methodology. A model that outputs "Eight" when the evaluation expects "8" fails under string matching despite having correctly computed the answer. Before attributing a failure to the model, the measurement should be ruled out.
The negation dimension illustrates several of these points simultaneously. No prompt condition in the study produced a statistically significant improvement in negation Coverage for either model. The targeted negation prompt for BART produced a margin of +3 (33 wins vs 30 losses, p = 0.937). This is not a small-sample noise issue — the same tests produced highly significant improvements for entities and dates (p < 0.001). Negation is informationally orthogonal to all other semantic dimensions in news text. It belongs to a different computational substrate than the dimensions that respond to prompting. No surface instruction can redirect the model toward a substrate it cannot access through the linguistic channel.
What this is and is not
I want to be precise about the framework's scope and limitations.
Coverage measurement as implemented here is operationally simple — dimensional extractors (NER probes and regex patterns) applied to source and output texts, mutual information estimated from binary co-occurrence counts. The theoretical foundations are more involved, but the measurement methodology is accessible and reproducible with standard tools.
The results are specific to the task (abstractive summarisation), the models (BART-large-cnn and DistilBART-cnn-6-6), and the dimensions tracked. I am not claiming the Coverage Ratio is a universal evaluation metric. I am claiming it reveals structure that aggregate metrics conceal, and that this structure matters for how you select models, design prompts, and diagnose failures.
The failure taxonomy is more theoretical than empirical at this stage. The distinctions are grounded in the formal framework and supported by the empirical results, but a complete empirical validation across multiple tasks and architectures remains to be done.
There are also open theoretical questions I am not yet in a position to answer. How Coverage extends to tasks where the input and output are not in the same representational space — where you are measuring not whether entities were preserved but whether a computation was correctly performed — requires a more careful treatment of what preservation means for transformation tasks. I am working on this.
Why this might matter beyond evaluation
The framing I have used here is about evaluation methodology. But the underlying issue is more general.
Language models are channels. They receive information and produce information. The question of what information survives transmission — and what is systematically lost, distorted, or suppressed — is not just an evaluation question. It is a question about what these systems are actually doing when they process text.
The pattern I found — that higher stylistic similarity to human references correlates negatively with dimensional fidelity on specific factual content — is a specific instance of a more general phenomenon. Optimising for aggregate similarity to reference outputs is not the same as optimising for faithful information transmission. When we evaluate models primarily on aggregate metrics, we are selecting for the former while hoping for the latter.
I think this has implications beyond summarisation, but I want to be careful not to overgeneralise from a single study. The empirical work is the foundation. The broader theoretical claims need more support before I would want to lean on them heavily.
Papers and further discussion
The engineering paper with full methodology and results is available on request. I am preparing additional papers in the series covering the theoretical foundations and extensions to the failure taxonomy - if time permits (as I'm forced to do this in my free time).
I am genuinely interested in pushback on the methodology, the claims, or the framing. Particularly: whether the Coverage Ratio as defined here is measuring what I claim it measures, whether the failure taxonomy distinctions hold up under scrutiny, and whether there are existing frameworks in the evaluation literature that address these issues in ways I have missed.
Appendix
A short comparison of the BART and DistilBART model.
BART-large-cnn:
pre-trained by Facebook AI Research using a denoising objective on large text corpora and then fine-tuned specifically on the CNN/DailyMail summarisation dataset
a sequence-to-sequence transformer with a bidirectional encoder and an autoregressive decoder
has 12 encoder layers and 12 decoder layers; roughly double the neural network depth of DistilBART-cnn-6-6
406m parameters
DistilBART-cnn-6-6:
a "distilled" version of BART-large-cnn, created by taking the trained BART-large-cnn model and applying knowledge distillation. In the distillation process, a smaller student model is trained to reproduce the behaviour of the larger teacher model.
the distillation of DistilBART was performed using CNN/DailyMail data
has 6 encoder layers and 6 decoder layers; roughly half the neural network depth of BART-large-cnn
230m parameters
DistilBART, while being half the size of BART-large, is supposed to approximate its behavior while being substantially leaner and thus computationally cheaper and faster.
My experiments demonstrate that both models are behaving differently, and that this difference is insufficiently captured by standard evaluation metrics like ROUGE-L.
Standard language model evaluation has a structural problem that I do not think gets enough attention. The problem is not that our metrics are imprecise. It is that they are measuring something different from what we actually care about.
This post introduces a framework I have been developing called Coverage, describes the empirical methodology and results from an evaluation study on abstractive summarisation, and outlines a failure taxonomy that I think is more useful for diagnosing model behaviour than current approaches. I am preparing a series of formal papers. This post is an informal introduction to the core ideas for readers who want to engage with them before the papers are finalised.
The aggregate metric problem
When we evaluate a language model on a task, we typically compare its output to a reference using a scalar metric. For summarisation, ROUGE-L measures string overlap between generated and reference summaries. For question answering, accuracy measures whether the final answer matched. For translation, BLEU measures n-gram overlap.
These metrics have a shared structural property: they aggregate across all semantic dimensions simultaneously. Every type of information the model was supposed to preserve or produce gets collapsed into a single number. This is convenient. It is also, I want to argue, the source of systematic evaluation failures.
Consider two models evaluated on the same summarisation task. Their ROUGE-L scores differ by less than one percentage point. Standard practice calls them equivalent and moves on.
But what if they are not equivalent? What if they preserve fundamentally different types of information, and the aggregate metric is averaging over those differences in a way that makes them invisible?
This is not a hypothetical. It is what I found.
What Coverage measures
The Coverage framework approaches evaluation differently. Rather than asking how similar the output is to a reference overall, it asks: for each semantic dimension you care about, how much of the source-side informational structure of that dimension survives transmission through the model?
The dimensions I tracked in the study were: named entities (Person, Location), numerical quantities, monetary figures, dates, and logical negation. These are not arbitrary — they represent types of information that matter for factual accuracy in news summarisation and that have clear, operationalisable definitions.
For each dimension, Coverage Ratio (or simply Coverage) estimates the mutual information between dimensional presence in the source document and dimensional preservation in the generated summary, normalized by the source-side entropy of that dimension:
Where
The Coverage Ratio takes values in the range . A value near 1 indicates that preservation closely tracks the underlying dimensional structure of the source distribution, whereas values near 0 indicate weak or statistically independent preservation behavior. Intuitively, the metric asks: given the amount of informational variability present in the source dimension, how much of that variability remains recoverable after transmission through the channel?
This is worth unpacking carefully, because it is easy to misread what this measures. The Coverage Ratio is not a recall measure. It does not tell you what fraction of monetary entities were preserved.
Consider this. In a corpus of source documents in 95% of cases a person entity occurrs. The uncertainty about occurrences in the source documents is low regarding this particular dimension. This implies that for a communication channel the task to preserve this dimension is relatively easy to fulfill. Compare this now to a situation where a person entity occurrs in 50% of source documents. Uncertainty on the presence of a person entity in a particular source document is very high. This implies that for a communication channel the task to preserve this dimension is much harder to fulfill.
The entropy normalization matters. A dimension that appears in 95% of source documents has low entropy, i.e. there is not much uncertainty about whether it will be present. A model that always includes content of that type will score well on raw recall but the Coverage Ratio will be modest, because the task was not informationally demanding. A dimension that appears in 50% of documents has maximum entropy, i.e. genuine uncertainty about presence. High Coverage Ratio here indicates the model is faithfully tracking that uncertainty, which is harder and more meaningful.
The empirical results
I evaluated two publicly available models — BART-large-cnn and DistilBART-cnn-6-6— on 1000 articles from the CNN/DailyMail test set. DistilBART is a knowledge-distilled version of BART, specifically designed to approximate BART's performance at lower computational cost, trained with the same CNN/DailyMail fine-tuning objective.
The aggregate ROUGE-L scores:
A difference of 0.008, i.e. 8 percentage points. Standard benchmarking calls these systems equivalent.
The Coverage Ratio profiles tell a different story. Selected results for the plain prompt condition (no additional instructions):
Dimension
BART-large-cnn
DistilBART-cnn-6-6
Person
0.238
0.266
Location
0.227
0.251
Money
0.210
0.293
Date
0.185
0.230
Numerical
0.273
0.306
Negation
0.107
0.120
DistilBART consistently achieves higher Coverage Ratio across all dimensions despite lower ROUGE-L. The gap on the Money dimension is particularly notable — DistilBART transmits substantially more of the source-side informational structure of monetary content than BART does.
The Spearman rank correlation between ROUGE-L score and Coverage Ratio across all test configurations is negative for four of the six dimensions, with the Location, Money, Date, and Numerical correlations reaching statistical significance. Higher ROUGE-L is associated with lower dimensional Coverage (see figure 1). The models that score better on the standard metric are preserving less of the specific semantic content.
Figure 1: ROUGE-L scores versus Coverage Ratio for BART-large-cnn and DistilBART-cnn-6-6 models
This is not a sampling artifact. It reflects something real about how BART-large achieves its higher ROUGE-L score. BART produces more abstractive, paraphrased summaries that are stylistically closer to human-written references. This drives ROUGE-L up. But the paraphrasing process frequently omits or transforms specific factual content — numbers, dates, monetary figures. DistilBART's more extractive behaviour produces outputs that are stylistically less similar to references but factually more faithful. The aggregate metric systematically rewards the wrong model for the wrong reasons in any application where factual preservation matters.
Prompting reallocates rather than expands
The study also evaluated both models under three additional prompt conditions: a general coverage prompt instructing the model to preserve key information, a numbers-focused prompt, and a negation-focused prompt.
The results reveal something important about how prompting interacts with model capacity.
For BART-large, targeted prompting produces measurable Coverage improvements on specific dimensions. The numbers prompt significantly improves Numerical Coverage (p = 0.005). Person and Date dimensions improve under any prompt modification, including the negation prompt — suggesting these dimensions are correlated enough in news text that any attentional shift benefits them.
But the improvements come at a cost. When one dimension's Coverage increases, others tend to decrease. The total Coverage across dimensions does not expand meaningfully under any prompt condition. The model is operating under a fixed informational capacity, and prompting shifts how that capacity is allocated rather than increasing it.
This has a direct practical implication. When you prompt a model to "preserve all numerical quantities," you are not making it a better model. You are making a tradeoff decision — trading Coverage on other dimensions for improved numerical Coverage. If you do not measure at the dimensional level, you will not see what you gave up.
The failure taxonomy
Coverage measurement tells you what failed. Diagnosing why it failed — and what would actually fix it — requires a different kind of analysis.
I have been developing a failure taxonomy derived from the formal components of the communication process. The goal is to produce a classification where each failure type has a distinct engineering remediation, not just a distinct name.
The most practically important distinctions:
Where in the pipeline did the failure occur? A failure can happen before the model receives the input, within the model's processing, or in the evaluation methodology itself. These require completely different responses. The inability of language models to count letters in a word is almost universally discussed as a model capability limitation. It is not. The tokeniser, which operates before the transformer, merges letter sequences into subword tokens. The model cannot count what it was never given. No improvement to the transformer will fix a tokeniser-level failure.
Why does the failure resist prompting? When prompting does not fix a failure, there are two fundamentally different underlying reasons. The first is architectural: the model lacks the computational substrate required for this type of operation, and this is not fixable through training or instruction. The second is a training gap: the architecture has the relevant capacity in principle but was not trained on sufficient relevant examples. These look identical from the outside — neither responds to prompting — but the engineering responses are completely different in cost and kind.
Is the apparent failure actually a measurement failure? A non-trivial fraction of apparent model failures are failures of the evaluation methodology. A model that outputs "Eight" when the evaluation expects "8" fails under string matching despite having correctly computed the answer. Before attributing a failure to the model, the measurement should be ruled out.
The negation dimension illustrates several of these points simultaneously. No prompt condition in the study produced a statistically significant improvement in negation Coverage for either model. The targeted negation prompt for BART produced a margin of +3 (33 wins vs 30 losses, p = 0.937). This is not a small-sample noise issue — the same tests produced highly significant improvements for entities and dates (p < 0.001). Negation is informationally orthogonal to all other semantic dimensions in news text. It belongs to a different computational substrate than the dimensions that respond to prompting. No surface instruction can redirect the model toward a substrate it cannot access through the linguistic channel.
What this is and is not
I want to be precise about the framework's scope and limitations.
Coverage measurement as implemented here is operationally simple — dimensional extractors (NER probes and regex patterns) applied to source and output texts, mutual information estimated from binary co-occurrence counts. The theoretical foundations are more involved, but the measurement methodology is accessible and reproducible with standard tools.
The results are specific to the task (abstractive summarisation), the models (BART-large-cnn and DistilBART-cnn-6-6), and the dimensions tracked. I am not claiming the Coverage Ratio is a universal evaluation metric. I am claiming it reveals structure that aggregate metrics conceal, and that this structure matters for how you select models, design prompts, and diagnose failures.
The failure taxonomy is more theoretical than empirical at this stage. The distinctions are grounded in the formal framework and supported by the empirical results, but a complete empirical validation across multiple tasks and architectures remains to be done.
There are also open theoretical questions I am not yet in a position to answer. How Coverage extends to tasks where the input and output are not in the same representational space — where you are measuring not whether entities were preserved but whether a computation was correctly performed — requires a more careful treatment of what preservation means for transformation tasks. I am working on this.
Why this might matter beyond evaluation
The framing I have used here is about evaluation methodology. But the underlying issue is more general.
Language models are channels. They receive information and produce information. The question of what information survives transmission — and what is systematically lost, distorted, or suppressed — is not just an evaluation question. It is a question about what these systems are actually doing when they process text.
The pattern I found — that higher stylistic similarity to human references correlates negatively with dimensional fidelity on specific factual content — is a specific instance of a more general phenomenon. Optimising for aggregate similarity to reference outputs is not the same as optimising for faithful information transmission. When we evaluate models primarily on aggregate metrics, we are selecting for the former while hoping for the latter.
I think this has implications beyond summarisation, but I want to be careful not to overgeneralise from a single study. The empirical work is the foundation. The broader theoretical claims need more support before I would want to lean on them heavily.
Papers and further discussion
The engineering paper with full methodology and results is available on request. I am preparing additional papers in the series covering the theoretical foundations and extensions to the failure taxonomy - if time permits (as I'm forced to do this in my free time).
I am genuinely interested in pushback on the methodology, the claims, or the framing. Particularly: whether the Coverage Ratio as defined here is measuring what I claim it measures, whether the failure taxonomy distinctions hold up under scrutiny, and whether there are existing frameworks in the evaluation literature that address these issues in ways I have missed.
Appendix
A short comparison of the BART and DistilBART model.
BART-large-cnn:
DistilBART-cnn-6-6:
DistilBART, while being half the size of BART-large, is supposed to approximate its behavior while being substantially leaner and thus computationally cheaper and faster.
My experiments demonstrate that both models are behaving differently, and that this difference is insufficiently captured by standard evaluation metrics like ROUGE-L.