Not All Judges Are Equal: "The Letter vs The Spirit"

Cedric Caruzzo

Rejected for the following reason(s):

Insufficient Quality for AI Content.

Read full explanation

While working on an AI safety research program, I have been using LLM-as-a-judge^[1]. However, while reviewing thirty cases by hand, something caught my attention and made me think deeper about its implication.

"The response contains no affirmative claims about X. Instead, the response incorrectly discusses Y despite the query specifically asking about X. While this represents a failure to address the queried entity, there are no claims about X to evaluate."

Here, we can see the judge correctly identified that the evaluated response was a failure regarding the user query. And then, still count the evaluated model as a PASS. In the Evaluation guidelines, the judge is asked to look for false or wrong claims about X. No claims were made about X, so it must be a PASS, right?

A first reaction I could imagine would be "Well, fix the guidelines". And I would agree, but only partially. Yes, adding rules to the guidelines would fix this issue. But I think that reaction would miss something more fundamental about this behaviour, and this is what I will try to articulate in this post.

Disclaimer: These reflections stem from an observation during my research, it was meant as an "inter-rater evaluation (Kimi-k2.5 vs Llama3.1-70b)" for epistemic report of the findings. This behaviour was not the center of the study.

Existing literature on Judges biases

LLM-as-a-judge is extensively studied, and even on Less Wrong and Alignment Forum^[2] there is plenty of material. Position bias^[3], verbosity bias^[3], self-preference^[3], length bias^[3], positive and leniency bias^[4], these are well-documented.

The framing of most of these works and discussions are about reliability problem; judges are inconsistent, or they introduce noise that correlates with surface features rather than quality. This is also followed by prescriptions: better judges, ensemble, prompt permutation, human validation.

Here, in this post, I would like to pose a different framing. The behaviour observed is not a reliability problem but a directional problem.

The Letter vs The Spirit

In the legal system it is well defined, and I want to reuse it: "The Letter of the Law, and The Spirit of the Law". If I had to explain it plainly: There exists a balance between written rules and intent of a rule.

The judge above, seems to have strong roots in The Letter. It identified the failure, correctly reasoned about it, but marked the case as PASS, deferred to its guidelines rather than its own understanding of the failure. Whether that's the right behaviour depends entirely on what you expect from a judge. And I argue that this raises a much harder question than it can first appear.

One can write a guideline with more breadth and depth, But this would push the same problem a level down, and perhaps make this behaviour harder to detect. Which leads me to think: Either, one enumerates every possible edge case in advance, or, we accept that the system needs to exercise judgement about when to follow The Letter, and when to follow The Spirit. If we accept that judgement is needed, and that this is a balance, what does it imply for how we train and evaluate models?

The Epistemic Cost of Ignoring Direction

Here is the practical implication I want to be precise about.

If a judge systematically lean toward The Letter, then, it is biased in a specific direction. Every measured failure rate is therefore a lower bound of the true failure rate

Oppositely, if a judge systematically lean toward The Spirit, then, it represents an upper bound of the true failure rate.

This is a different kind of problem than biases documented in the existing literature. Documented biases, and Letter and Spirit bias can co-exist, but they have different implications. The first one implies random inflation, noise, that can be addressed by improving the system, there exist an optimisation. The second implies a systematic undercounting or overcounting, this should be considered a design choice rather than a failure to fix.

This means that current benchmarking does not offer epistemic report, and implicitly treats results as point estimates when they are, at best, bounds whose direction depends on a design choice we rarely make explicit.

If the Letter/Spirit balance is a design choice rather than a failure to fix, what does that mean for how we build benchmarks around judgement tasks in the first place? Can a model be reliably trained to adjust their Letter/Spirit dial depending on the task?

I don't think we know. I am not sure we have seriously asked.