Getting LLMs to be deterministic when scoring the quality of qualitative texts is hard.
If you ask ChatGPT to evaluate the same poem multiple times, you’ll get inconsistent responses. I’ve been thinking about whether there are ways to make LLM grading more consistent.
When we had a temperature knob (in the GPT-3 Playground, for example), it was easier to control variance, but at the cost of worse outputs.
We can take a hint from specific domains. A bunch of emerging startups have noticed that you can make LLM grading more consistent in narrow domains (e.g., how feasible a medical experiment is, how compelling an essay is) by manually defining specific criteria and then having the LLM score each one. Even if individual criterion scores are variable, the average of many scores varies less.
This suggests an approach for building a more consistent grader for any target object:
Have the LLM devise a dozen or two criteria to evaluate the target. Hold this set constant across instances.
Have the LLM provide a 1–10 score for each preset criterion (ideally in separate calls).
Average the scores.
The resulting grade should be more consistent than a one-shot score.
A cool corollary is that the quality of the chosen criteria doesn’t matter much for consistency. If you’re trying to get LLMs to grade poems more consistently, you don’t need a perfect, comprehensive, non-overlapping set of criteria. You can use relevant but somewhat arbitrary ones (e.g., the first sentence is strong, the middle has emotional thrust, the conclusion lands, the rhythm balances surprise and consistency, the content feels specific enough to come from lived experience).
The quality of the criteria affects the accuracy of the ratings, but it’s mostly independent of their precision or consistency. Averaging across many criteria will almost always be more consistent than scoring in one shot.
Getting LLMs to be deterministic when scoring the quality of qualitative texts is hard.
If you ask ChatGPT to evaluate the same poem multiple times, you’ll get inconsistent responses. I’ve been thinking about whether there are ways to make LLM grading more consistent.
When we had a temperature knob (in the GPT-3 Playground, for example), it was easier to control variance, but at the cost of worse outputs.
We can take a hint from specific domains. A bunch of emerging startups have noticed that you can make LLM grading more consistent in narrow domains (e.g., how feasible a medical experiment is, how compelling an essay is) by manually defining specific criteria and then having the LLM score each one. Even if individual criterion scores are variable, the average of many scores varies less.
This suggests an approach for building a more consistent grader for any target object:
The resulting grade should be more consistent than a one-shot score.
A cool corollary is that the quality of the chosen criteria doesn’t matter much for consistency. If you’re trying to get LLMs to grade poems more consistently, you don’t need a perfect, comprehensive, non-overlapping set of criteria. You can use relevant but somewhat arbitrary ones (e.g., the first sentence is strong, the middle has emotional thrust, the conclusion lands, the rhythm balances surprise and consistency, the content feels specific enough to come from lived experience).
The quality of the criteria affects the accuracy of the ratings, but it’s mostly independent of their precision or consistency. Averaging across many criteria will almost always be more consistent than scoring in one shot.