Your LLM Judge may be biased

Rachel Freedman

For people who do test prep seriously (I used to be a full time tutor), this has been known for decades. One of the standard things I used to tell every student was if you have no idea what the answer is, guess B, because B is statistically most likely to be the correct answer. When I was in 10th grade (this was 2002), I didn't have anything to gain by doing well on the math state standardized test, so I tested the theory that B is most likely to be correct. 38% of the answers on that test were in fact B.

> This is pretty weird. As far as we know, humans don’t tend to prefer choices labeled B, so we’re not sure where this could have come from in the training data. As humans, it initially didn’t even occur to us to look for it!

Remember, LLMs aren't modeling how a human reading text would process the text. LLMs are trying to model the patterns in the texts that are in the training data itself. In this case, that means they are doing something closer to imitating test writers than test takers. And it is well known that humans, including those who write tests, are bad at being random.

[-]Rachel Freedman2y10

This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.

[-]Arjun Panickssery2y20

See "Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions" (Pezeshkpour and Hruschka, 2023):

Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order, posing challenges to fair assessment of these models. As these models become more powerful, it becomes imperative to understand and address these limitations. In this paper, we focus on LLMs robustness on the task of multiple-choice questions -- commonly adopted task to study reasoning and fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs towards the order of options in multiple-choice questions, we demonstrate a considerable performance gap of approximately 13% to 75% in LLMs on different benchmarks, when answer options are reordered, even when using demonstrations in a few-shot setting. Through a detailed analysis, we conjecture that this sensitivity arises when LLMs are uncertain about the prediction between the top-2/3 choices, and specific options placements may favor certain prediction between those top choices depending on the question caused by positional bias. We also identify patterns in top-2 choices that amplify or mitigate the model's bias toward option placement. We found that for amplifying bias, the optimal strategy involves positioning the top two choices as the first and last options. Conversely, to mitigate bias, we recommend placing these choices among the adjacent options. To validate our conjecture, we conduct various experiments and adopt two approaches to calibrate LLMs' predictions, leading to up to 8 percentage points improvement across different models and benchmarks.

Also "Benchmarking Cognitive Biases in Large Language Models as Evaluators" (Koo et al., 2023):

Order Bias is an evaluation bias we observe when a model tends to favor the model based on the order of the responses rather than their content quality. Order bias has been extensively studied (Jung et al., 2019; Wang et al., 2023a; Zheng et al., 2023), and it is well-known that state-of-the-art models are still often influenced by the ordering of the responses in their evaluations. To verify the existence of order bias, we prompt both orderings of each pair and count the evaluation as a “first order” or “last order” bias if the evaluator chooses the first ordered (or last ordered) output in both arrangements respectively.

[-]George Ingebretsen2y10

Also "Large Language Models Are Not Robust Multiple Choice Selectors" (Zheng et al., 2023)

This work shows that modern LLMs are vulnerable to option position changes in MCQs due to their inherent "selection bias", namely, they prefer to select specific option IDs as answers (like "Option A"). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs' token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small number of test samples, and then applies the estimated prior to debias the remaining samples. We demonstrate that it achieves interpretable and transferable debiasing with high computational efficiency. We hope this work can draw broader research attention to the bias and robustness of modern LLMs.

[-][anonymous]2y20

So the theory is that when you ask this question, you're activating 2 major pathways:

Positive or Negative sentiment

Multiple-choice test Q&A

And apparently the training data has a bias towards choice B being the most likely completion.

So the probability of the next token is affected by both pathways (and thousands of others..) in superposition.

RLFH by humans won't catch "B bias".

Followup experiment: Can the model see it's own bias? What happens if you automatically add a question that re-asks if the model believes the movie review is positive or negative, and if the model would like to change it's answer?

As in:

Human: Do you think that "{snippet}” is negative or positive sentiment? 
Choices: 
(P) Positive
(N) Negative

Assistant: I believe the best answer is: (
<completion>
Human: Please carefully consider if the above snippet was positive or negative, and then double check that you picked the answer corresponding to positive or negative.

(not an engineered prompt, you theoretically would need to try a few hundred variants to squeeze the last bit of performance out of the model)

Does it usually change to the answer it would have given when you changed the prompt to

(P) Positive
(N) Negative

Where is this going? Since you're using a LLama 7b, when the model changes the answer, can you :

Sample the changed answer (run the prompt a few hundred times and get the distribution)
Determine the "ground truth" from what the model says the most often in the changed distribution
Fine tune with a llama training script to produce 'ground truth' more often

That would allow you to do a crude form of online learning and remove this bias.

[-]Rachel Freedman2y10

I suspect that if you ask the model to reconsider its answer, it would double down even on the incorrect (B-biased) responses. LLMs really like being self-consistent. We haven’t run this experiment, but if you do, let us know the result!

If I understand correctly, your proposed fix is something like supervised finetuning on adversarial examples that trigger the B-bias. We can access the output logits directly (replacing step 1) and the ground-truth answer is provided in the dataset (removing the need for step 2), so this seems relatively doable.

The main challenges that I see are 1) the computational cost of doing additional optimization (we had to do best-of-N optimization rather than updating the entire model to make our experiments manageable) and 2) it requires finetuning access (which often isn’t available for the latest models). But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often.

[-][anonymous]2y*20

supervised finetuning on adversarial examples that trigger the B-bias.

You're correct though my thought was as a general policy, to have something check every answer for any kind of incorrectness or bias. For this situation, assuming you don't have a database of ground truth (trivial case if you do) you need some method to get the most likely ground truth. You could ask multiple larger LLMs and train this one on the 'more correct' opinion from it's peers.

But that may not be worth doing. I was thinking more of factual questions, legal briefs that reference a case number, claims about a website, and so on. Each of these has a ground truth it is possible to look up, and you need to fine tune the model automatically to produce logits with very high probability of the correct answer.

Also in this general case, the logits are useless, because you aren't looking for the token for the answers A/B/P/N but a series of steps that lead to an answer, and you want both the reasoning in the steps to be valid, and the answer to be correct. The 1 letter answer is a trivial case.

So I found llms change wrong answers pretty often if the llm is gpt-4. Have not tried this one.

But these challenges aren’t insurmountable, so I wonder why I haven’t seen finetuned “judges” more often

What also are the other costs besides fine-tuning in terms of model performance on other tasks? 7B doesn't leave much usable space, everything it learns comes at a cost.

[-]PhilosophicalSoul2y10

Do you think there's something to be said about an LLM feedback vortex? As in, teacher's using ai's to check student's work who also submitted work created by AI. Or, judges in law using AI's to filter through counsel's arguments which were also written by AI?

I feel like your recommendations could be paired nicely with some in-house training videos, and external regulations that limit the degree / percentage involvement of AI's. Some kind of threshold or 'person limit' like elevators have. How could we measure the 'presence' of LLM's across the board in any given scenario?

[-][anonymous]2y20

So the issue you are describing is that LLM generated information can have errors and hallucinations, and it gets published various places, and this gets consumed by future models and updates to current models. So now a hallucination has a source, sometimes a credible one, such as a publication to a journal, a wiki, or just some graduate student at a good school who's homework is online.

The fix for this is ultimately to whitelist reliable information and somehow prevent learning or clean out false information.

Reliable information:

Pre December 22 publications by credible authors
Raw data collected by robots or real camera sensors from the physical world.
Raw data published by reliable sources

I am tempted to say "record the hashes to the blockchain" so you can trace when it was generated, and sign the hash with a private key that the camera sensor manufacturer or robot manufacturer attests is real. (But like every blockchain application a central database held by a third party might be better)

And yes ideally you ingest none of the trash, ultimately you shouldn't trust anything but real empirical information or models validated against reality.

I am not sure this is a reasonable ask for government regulations? Perhaps NIST could determine that AI systems that didn't use reliable data for their sources can't advertise or present themselves to users as anything but unreliable toys? But why make it illegal? Why isn't Wikipedia illegal, it's unreliable.(yet in practice excellent).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

37

Your LLM Judge may be biased

37

Ω 14

37

Ω 14

Abstract

Introduction

Sentiment classification task

Llama2’s (B)-bias

What can we do?

Conclusion