Making LLM Graders Consistent

Davey Morse

9 Making LLM Graders Consistent

by Davey Morse

13th Jan 2026

1 min read

0

9

Getting LLMs to be deterministic when scoring the quality of qualitative texts is hard.

If you ask ChatGPT to evaluate the same poem multiple times, you’ll get inconsistent responses. I’ve been thinking about whether there are ways to make LLM grading more consistent.

We can take a hint from specific domains. A bunch of emerging startups have noticed that you can make LLM grading more consistent in narrow domains (e.g., how feasible a medical experiment is, how compelling an essay is) by manually defining specific criteria and then having the LLM score each one. Even if individual criterion scores are variable, the average of many scores varies less.

This suggests an general approach for building a consistent LLM grader:

Have the LLM devise a dozen or two criteria to evaluate the target. Hold this set constant across instances.
Have the LLM provide a 1–10 score for each preset criterion (ideally in separate calls).
Average the scores.

The resulting grade will be more consistent and vary less across calls than a one-shot score.

A cool corollary is that the quality of the chosen criteria doesn’t matter much for consistency. If you’re trying to get LLMs to grade poems more consistently, you don’t need a perfect, comprehensive, non-overlapping set of criteria. You can use relevant but somewhat arbitrary ones (e.g., the first sentence is strong, the middle has emotional thrust, the conclusion lands, the rhythm balances surprise and consistency, the content feels specific enough to come from lived experience).

The quality of the criteria affects the accuracy of the ratings, but it’s mostly independent of their precision or consistency. Averaging across many criteria will almost always be more consistent than scoring in one shot.

AI

Frontpage

9

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

9

Making LLM Graders Consistent

9

9

9