Difficult to evaluate, with potential yellow flags.
Read full explanation
In the past few weeks I have been busy with an idea that kept nagging me: I believe we are evaluating language models the wrong way. While I am still working on a paper proving my point, my claim is already more than just a hypothesis. Preliminary data I have gathered actually confirms my point.
Consider the way how language models are evaluated today. A common approach is to define one or multiple scalar metrics like:
ROUGE-L measuring string overlap between generated and reference summaries,
BLEU measuring n-gram overlap, or
F1 Score for classification tasks.
The issue with all those metrics is that, at closer look, they aggregate across all semantic dimensions simultaneously. Which implies that they conceal important information.
My thoughts are rooted in a novel theory I've been developing. I'll call it Coverage Theory. The basic idea is simple: Imagine two systems (two nodes in a network, two human beings, or an encoder and decoder process in a language model) communicating with each other via a communication channel. The crucial question Coverage Theory tries to address is: How much information that is theoretically available in the source is preserved via the communication channel?
Mutual Information, Entropy, Coverage Ratio
To answer this question I rely on Shannon's information theory, in particular on the two concepts of mutual information and entropy .
Mutual Information measures how strongly output-side preservation in depends on source-side dimensional presence in .
Entropy measures the variability of information in the source domain.
The ratio
was described by Shannon & Weaver in 1948. The same metric is frequently also associated with the Theil 1970 who called it the uncertainty coefficient. In this context, I take the liberty to refer to it as the coverage ratio.
The coverage ratio can take values in the range of . It asks: Given the amount of informational variability in a source dimension , how much of that variability remains recoverable after transmission through the channel?
Illustration. Assume that in a corpus of source documents almost every document contains an instance of a person entity. Uncertainty whether any given document contains a person entity is low. A language model is asked to perform a text summarisation faithfully in the sense that the summary must contain the person entity as-is. The model easily achieves a high summarisation score. Assume now a different corpus of source documents where only half of documents contain a person entity. Intuitively, the task to preserve information on person entities faithfully for summarisation tasks is much harder.
What this metric allows us is to measure different dimensions of information present in the source, compare it to our target, and calculate how much of the available information is being preserved throughout the channel. In principle, the coverage ratio should be able to convey information that is concealed by scalar metrics like ROUGE-L.
The empirical results
To test my hypothesis I created an experiment. I selected two well-known language models: BART-large-cnn (406m parameters, 12 encoder and 12 decoder layers), and DistilBART-cnn-6-6 (230m parameters, 6 encoder and 6 decoder layers). DistilBART is a "distilled" model directly derived from BART, so it is both smaller and computationally cheaper/faster than BART. Both models are trained on the same CNN/DailyMail dataset. Importantly, both models achieve almost the same ROUGE-L score on standard summarisation tests with a negligible difference.
I asked both models to create summaries for 1000 CNN/DailyMail articles. I then evaluated how faithfully both models preserved different semantic entities in their summaries: persons, locations, money amounts, dates & times, generic numerical values, and negations. For the evaluation, I aimed at simplicity rather than sophistication: For the numerical and negation dimensions I used regular expressions to compare source text with target summary, for the other dimensional entities I relied on the Python spaCy library.
I ran the summarisation tasks for both models on all 6 dimensions / entities with 4 distinct prompts on 1000 articles:
"plain" / no prompt: without any specific prompt,
coverage prompt: with a prompt to preserve generic coverage over all entities,
numbers prompt: with a prompt to preserve numerical values,
negation prompt: with a prompt to preserve negations.
I calculated both the ROUGE-L plus the coverage ratio metrics for each combination of dimension / entity plus prompt. Figure 1 shows the results.
Figure 1: ROUGE-L vs Coverage Ratio pairs for each Coverage dimension.
We need to examine both scores separately.
ROUGE-L: BART consistently achieves a higher ROUGE-L score compared to DistilBART, although the differences are comparably small and not statistically significant for many real-world scenarios. This is to be expected from a larger, more powerful and computationally heavier model.
Coverage Ratio: Surprisingly, though, DistilBART scored higher on the coverage ratio for 4 out of 6 configurations. It preserved locations, money amounts, dates & times, and numerical values generally better than its more powerful parent! For example, the source information available on money amounts were preserved better by DistilBART than BART by 8 percentage points. Only for the person and negation dimensions the effect was not obvious. To confirm, I calculated Spearman correlations between the ROUGE-L scores and the dimension-specific coverage ratios, which confirmed the finding at the 95% threshold.
The implications are intriguing: both models seem equivalent when measured by the scalar ROUGE-L score. Yet, they show a different behavior when measured by the coverage ratios per distinct entity. For four out of six chosen entities the smaller model beats the larger one, for two out of six no clear winner can be determined.
But that's not all.
Prompting reallocates rather than expands channel capacity
Further investigations revealed that prompting may produce measurable coverage improvements on specific dimensions.
For example, for BART-large-cnn the explicit prompt to preserve numerical values significantly improved numerical coverage (p = 0.005). Person and date dimensions improved under any prompt modification, including the negation prompt. This suggests these dimensions are correlated enough in news text that any attentional shift benefits them.
However, the improvements come at a cost. While some prompts were capable of improving not only their own dimensions but also foreign ones, other prompts would have a more varied effect, improving certain dimensions while deteriorating others. There were also situations where a prompt did not show any apparent effect on its own dedicated dimension at all.
In a nutshell, in my experiments prompting seemed to reallocate channel capacity rather than expand it in non-obvious ways. This is a crucial finding as it implies that prompting incurs trade-offs how it preserves dimensions between source and target. This also makes sense intuitively given the underlying model parameter size and thus channel capacity is not affected fundamentally by prompting.
What's next?
I have left out quite many other interesting findings that I hope to publish in a scientific paper soon. (As an independent researcher both my time and resources I can afford do spend on this are limited.) The Coverage Theory perspective, i.e. interpreting language models as communication channels that encode, transform and decode information from a source domain to a target domain has led to many insights I have not seen described anywhere else yet.
Obviously, the results are not firmly established yet, and more research is required to confirm them. In particular, I have only investigated on summarisation tasks specifically, and there are many other types of tasks left open, some of which are a lot harder to assess.
I am interested in feedback, positive and negative, from the community on the methodology, the claims, or the framing. In particular whether you think the coverage ratio is actually measuring what I think it is. My impression is I only scratched the surface, and there is a lot more to uncover here.
In the past few weeks I have been busy with an idea that kept nagging me: I believe we are evaluating language models the wrong way. While I am still working on a paper proving my point, my claim is already more than just a hypothesis. Preliminary data I have gathered actually confirms my point.
Consider the way how language models are evaluated today. A common approach is to define one or multiple scalar metrics like:
The issue with all those metrics is that, at closer look, they aggregate across all semantic dimensions simultaneously. Which implies that they conceal important information.
My thoughts are rooted in a novel theory I've been developing. I'll call it Coverage Theory. The basic idea is simple: Imagine two systems (two nodes in a network, two human beings, or an encoder and decoder process in a language model) communicating with each other via a communication channel. The crucial question Coverage Theory tries to address is: How much information that is theoretically available in the source is preserved via the communication channel?
Mutual Information, Entropy, Coverage Ratio
To answer this question I rely on Shannon's information theory, in particular on the two concepts of mutual information and entropy .
The ratio
was described by Shannon & Weaver in 1948. The same metric is frequently also associated with the Theil 1970 who called it the uncertainty coefficient. In this context, I take the liberty to refer to it as the coverage ratio.
The coverage ratio can take values in the range of . It asks: Given the amount of informational variability in a source dimension , how much of that variability remains recoverable after transmission through the channel?
Illustration. Assume that in a corpus of source documents almost every document contains an instance of a person entity. Uncertainty whether any given document contains a person entity is low. A language model is asked to perform a text summarisation faithfully in the sense that the summary must contain the person entity as-is. The model easily achieves a high summarisation score. Assume now a different corpus of source documents where only half of documents contain a person entity. Intuitively, the task to preserve information on person entities faithfully for summarisation tasks is much harder.
What this metric allows us is to measure different dimensions of information present in the source, compare it to our target, and calculate how much of the available information is being preserved throughout the channel. In principle, the coverage ratio should be able to convey information that is concealed by scalar metrics like ROUGE-L.
The empirical results
To test my hypothesis I created an experiment. I selected two well-known language models: BART-large-cnn (406m parameters, 12 encoder and 12 decoder layers), and DistilBART-cnn-6-6 (230m parameters, 6 encoder and 6 decoder layers). DistilBART is a "distilled" model directly derived from BART, so it is both smaller and computationally cheaper/faster than BART. Both models are trained on the same CNN/DailyMail dataset. Importantly, both models achieve almost the same ROUGE-L score on standard summarisation tests with a negligible difference.
I asked both models to create summaries for 1000 CNN/DailyMail articles. I then evaluated how faithfully both models preserved different semantic entities in their summaries: persons, locations, money amounts, dates & times, generic numerical values, and negations. For the evaluation, I aimed at simplicity rather than sophistication: For the numerical and negation dimensions I used regular expressions to compare source text with target summary, for the other dimensional entities I relied on the Python spaCy library.
I ran the summarisation tasks for both models on all 6 dimensions / entities with 4 distinct prompts on 1000 articles:
I calculated both the ROUGE-L plus the coverage ratio metrics for each combination of dimension / entity plus prompt. Figure 1 shows the results.
Figure 1: ROUGE-L vs Coverage Ratio pairs for each Coverage dimension.
We need to examine both scores separately.
The implications are intriguing: both models seem equivalent when measured by the scalar ROUGE-L score. Yet, they show a different behavior when measured by the coverage ratios per distinct entity. For four out of six chosen entities the smaller model beats the larger one, for two out of six no clear winner can be determined.
But that's not all.
Prompting reallocates rather than expands channel capacity
Further investigations revealed that prompting may produce measurable coverage improvements on specific dimensions.
For example, for BART-large-cnn the explicit prompt to preserve numerical values significantly improved numerical coverage (p = 0.005). Person and date dimensions improved under any prompt modification, including the negation prompt. This suggests these dimensions are correlated enough in news text that any attentional shift benefits them.
However, the improvements come at a cost. While some prompts were capable of improving not only their own dimensions but also foreign ones, other prompts would have a more varied effect, improving certain dimensions while deteriorating others. There were also situations where a prompt did not show any apparent effect on its own dedicated dimension at all.
In a nutshell, in my experiments prompting seemed to reallocate channel capacity rather than expand it in non-obvious ways. This is a crucial finding as it implies that prompting incurs trade-offs how it preserves dimensions between source and target. This also makes sense intuitively given the underlying model parameter size and thus channel capacity is not affected fundamentally by prompting.
What's next?
I have left out quite many other interesting findings that I hope to publish in a scientific paper soon. (As an independent researcher both my time and resources I can afford do spend on this are limited.) The Coverage Theory perspective, i.e. interpreting language models as communication channels that encode, transform and decode information from a source domain to a target domain has led to many insights I have not seen described anywhere else yet.
Obviously, the results are not firmly established yet, and more research is required to confirm them. In particular, I have only investigated on summarisation tasks specifically, and there are many other types of tasks left open, some of which are a lot harder to assess.
I am interested in feedback, positive and negative, from the community on the methodology, the claims, or the framing. In particular whether you think the coverage ratio is actually measuring what I think it is. My impression is I only scratched the surface, and there is a lot more to uncover here.