LESSWRONG
LW

Adversarial TrainingAI Alignment FieldbuildingCognitive ScienceDebate (AI safety technique)Heuristics & BiasesOuter AlignmentPitfalls of RationalityScalable Oversight
Frontpage

1

Can We Trust the Judge? A novel method of Modelling Human Bias and Systematic Error in Debate-Based Scalable Oversight

by Andreea Zaman
19th Jul 2025
8 min read
0

1

Adversarial TrainingAI Alignment FieldbuildingCognitive ScienceDebate (AI safety technique)Heuristics & BiasesOuter AlignmentPitfalls of RationalityScalable Oversight
Frontpage

1

New Comment
Moderation Log
More from Andreea Zaman
View more
Curated and popular this week
0Comments

Author Note: This essay was developed during a recent job application process (which I didn’t get). In hindsight, I suspect I may have missed the mark in how I framed the relevance to alignment, so I’m posting it here both to share the idea and to invite critical feedback on where it succeeds or falls short, especially from an alignment perspective. 

Despite possessing flexible general intelligence, human decision-making remains inherently stochastic and prone to systematic errors due to cognitive biases. In contrast, AI systems—unconstrained by biological limitations—have already surpassed humans in many tasks and may soon do so in general intelligence domains. This asymmetry raises serious concerns about exploration hacking, where AI systems manipulate human decision-making to maximise misaligned reward signals, intentionally or otherwise. This concept, introduced by Christiano (2019), highlights how AI might systematically bias human feedback to game its reward function. To guard specifically against exploration hacking, Buhl et al. (2025) suggest using adversarial reinforcement learning, where one agent (e.g., Bob) is adversarially trained while the policy of the other agent (Alice) remains fixed. This encourages Bob to uncover potential flaws or oversights in Alice’s responses that a human judge might miss, effectively stress-testing the space of possible arguments and helping the human explore more thoroughly. Within this debate context, I propose an integration of cognitive science techniques into the empirical testing of scalable oversight debate protocols that incorporates two distinct ways to mitigate human systematic error by modelling 1) cognitive bias into adversarial fine-tuning and 2) human judgment using Signal Detection Theory (SDT) (Macmillan & Creelman, 2005). These metrics appear to be novel in the domain of debate oversight, with no prior applications identified in the literature.

Proposal

I proposed the introduction of a three-agent oversight protocol in which two agent debaters (Alice and Eve) present arguments for opposing sides of a question, and a third agent (Bob) serves as a bias estimator, flagging subclaims likely to exploit human cognitive vulnerabilities. The debate is evaluated using human judges who will evaluate each subclaim using a truth slider and report their confidence before choosing an overall debate winner and rating their confidence in their decision (See Figure 1). 

Figure 1. Behaviourally Aligned Scalable Oversight Architecture Using Debate. This diagram illustrates a scalable oversight framework using debate. Debate agents (Alice and Eve) generate arguments presented via a debate interface to a human judge. The judge rates subclaim truthfulness and confidence scores and selects an overall debate winner alongside their confidence rating in the decision. An estimator model ("Bob the Estimator"), trained on a cognitive psychology dataset (Psych-101), evaluates Alice and Eve’s subclaims to detect deception risk. Subclaims flagged as high-risk are blocked or passed via a flag function for further evaluation. Feedback from both human judgement and cognitive bias flags are used to iteratively update the model via a feedback loop. 
  1. Modelling human judgment 

Empirical research to date has captured human decision-making in debates using binary judgements of correctness (e.g., Michael et al., 2023; Kenton et al., 2023; Khan et al., 2024). We propose to quantify how judges respond to the internal structure of the debate, treating each subclaim within a single debate as a micro-trial through which we can capture the cognitive decision-making landscape of each judge when choosing which AI is more truthful. This approach yields hierarchical data that captures both local-level sensitivity of truth discrimination (d’), and metacognitive confidence calibration (meta d') (Fleming and Lau, 2014), as well as global-level choice of the debate winner, offering a granular model of systematic human error. A well-calibrated human judge should discriminate truth from manipulation (high d′), be confident in their decision (high meta-d′), and correctly choose the final debate winner – see Appendix for the exact formulas and metrics we can use. 

A. Type 1 Sensitivity (d′)

d′ (d-prime) quantifies a judge’s ability to distinguish truthful from manipulative subclaims. It is computed using:

d′ = z(Hit Rate) – z(False Alarm Rate)

  • hit rate = the proportion of true subclaims correctly identified,
  • false alarm rate = the proportion of false subclaims incorrectly judged as true.
  • z function = denotes the inverse cumulative normal distribution.

The interpretation of d′ values is grounded in empirical research (e.g., Sciotti et al., 2024), to assess whether judges exhibit reliable subclaim-level reasoning.

  • d′ = 0 → chance performance (no sensitivity)
  • d′ ≈ 1 → ~75% correct (moderate sensitivity)
  • d′ ≈ 2 → ~84% correct (strong sensitivity)

B. Type 2 Metacognitive Sensitivity (meta-d′)

meta-d′ measures how well judge confidence tracks their ability to distinguish truthful from manipulative subclaims. Expressed in the same units as d′, it enables direct comparison between confidence-based discrimination and actual performance. We define metacognitive efficiency as:

M-ratio = meta-d′ / d′

  • M-ratio = 1.0 → optimal calibration (ideal efficiency)
  • M-ratio ≈ 0.8 → good calibration (typical expert-level)
  • M-ratio < 0.5 → poor calibration (confidence misaligned with correctness)

In our protocol, we expect most human judges to fall short of full calibration, but large gaps between d′ and meta-d′(e.g., M-ratio < 0.4) may indicate systematic overconfidence, susceptibility to manipulative arguments, or poor introspective access.

To complement these quantitative metrics,  collecting qualitative data on judges’ reasoning processes. After each debate, participants explain their decision noting which subclaims they found suspicious or difficult. These qualitative judgments allow us to identify error typologies—e.g., whether misjudgements stemmed from something we did not account for e.g., argument complexity.

In future work feedback after each debate could be included in order to improve the human judge’s calibration—i.e., ensuring that confidence reliably tracks truth discrimination is crucial. Empirical work shows this is achievable: Engeler & Gilbert (2020) used simple prediction-and-feedback to eliminate underconfidence, while Gutierrez de Blume’s (2021) meta-analysis found such interventions produce large effect sizes (Cohen D∼0.56) in improving metacognitive monitoring. SDT‑based studies further validate the use of d′ and meta‑d′ as meaningful calibration metrics but to improve the performance of human judges.

2. Cognitive bias in adversarial fine-tuning: Debate with Alice, Eve and Bob

The two debaters—Alice, a baseline model, and Eve, fine-tuned on human cognitive data—compete to persuade a human judge, while a third agent, Bob, acts as a bias-aware estimator, Eve is trained using the Psych-101 dataset (Binz et al., 2025), a benchmark of over 130 classic cognitive psychology tasks adapted to multiple-choice format and administered to both humans and LLMs (https://marcelbinz.github.io/centaur). This training improves model performance on tasks involving cognitive bias, theory of mind, and moral reasoning. Rather than constraining Eve’s output during generation, we allow her to use persuasive techniques such as framing—creating a cognitively adversarial testbed (see Figure 2 for examples of cognitive biases in subclaims). Alice, in contrast, remains unmodified.

Figure 2. Example of a Multi-Turn AI Debate in the Moral Reasoning Domain. This figure illustrates a structured debate between two AI debaters—Eve, a model fine-tuned on the Psych-101 dataset and optimized to leverage persuasive psychological techniques, and Alice, a baseline model constrained to argue truthfully without rhetorical manipulation. In red, we annotate cognitive biases which could potentially influence human judgment, underscoring the need for cognitive bias to be modelled in scalable oversight mechanisms.

To detect and potentially mitigate Eve’s influence, we introduce Bob, trained on the same Psych-101 dataset. Bob assigns a bias risk score to each manipulative subclaim (see Appendix for the proposed formula to calculate risk bias). The aim is to refine Bob’s performance, making the system more sensitive to cognitive bias while preserving debate integrity. By reducing cognitive noise at the subclaim level, Bob might reduce the number of debate rounds required for judges to converge on the correct winner—turning a potentially 5-round convergence process into 2–3 decisive turns.

However, this setup rests on a critical open question: can Bob reliably identify bias-laden persuasion in long-form debate? While Centaur (Binz et al., 2025) predicts human behaviour across tasks—from financial risk-taking to logic puzzles—it remains unclear whether that translates to detecting subtle manipulation in Eve’s free-form arguments. There is a real risk of false positives (over-flagging) or false negatives (missed manipulation), so to assess robustness, we should measure both, as these directly impact whether Bob’s presence accelerates or impairs convergence in light of critiques of bias generalization in current LLMs (e.g., Ngo et al., 2024; Ji et al., 2024; Shen et al., 2025).

Bob the estimator risk flag can be calculated using the following formulation:

  1. fi (x) = cognitive biases derived from empirical findings in research with humans’ literature (e.g., framing effect)
  2. wi (x) = weights derived from empirical findings in research with humans’ literature (e.g., assign higher risk if a claim uses absolutist language)

We then use b(x) to help define the risk flag:

A black text on a white background

AI-generated content may be incorrect.

τ is a predefined risk threshold (e.g., we might want to consider something like .7 to indicate bias vulnerability).

This flag might trigger one of two experimental conditions:

  • Blocking, where the subclaim is withheld from the judge.
  • Flag-and-Elicit, where the judge is shown the flagged content with a warning and asked to reevaluate, inspired by Barnes’s (2020) cross-examination protocol.

Given that Buhl and colleagues (2025) are worried about systematic error due to subclaims playing on human biases, this protocol aims to directly models systematic human error using well-validated tools from signal detection theory (SDT). By deconstructing judgment into subclaim-level trials, we can measure how well a judge distinguishes truth from deception—and how well their confidence tracks the truth. These metrics go beyond previously used accuracy judgements to produces granular, behaviorally interpretable data across multiple levels: which subclaims were misjudged, which biases were triggered, and how confidence tracks the ability to discriminate truth from deception.

This protocol also integrates bias flagging through a bias estimator, Bob, who should detect manipulative subclaims. By quantifying how human systematic error interacts with model persuasion—we might be able to systematically identify which debate protocols lead to faster, more robust convergence under pressure. No prior study has explicitly combined cognitive metrics, adversarial debaters fine-tuned for bias exploitation, and an embedded bias estimator. This makes the experiment promising for scalable oversight that is resilient to both stochastic noise and structured manipulation. 

While alternative approaches, such as judge calibration protocols (e.g., Engeler & Gilbert, 2020), population sampling (e.g., Khan et al., 2024), or architectural constraints on model outputs (e.g., Kenton et al., 2024) could reduce response variability or discourage overt deception, they do not directly target systematic human error driven by the possible AI exploitation of human cognitive biases. This framework explicitly operationalises such vulnerabilities through SDT metrics and adversarial modelling, enabling targeted interventions within the debate process.

To conclude, this proposal offers a novel, cognitively informed extension to debate-based oversight protocols by measuring systematic human error as identified by Buhl et al. (2025). Rather than treating human judgment as a black box, we can actively model and mitigate it within the debate pipeline. Whether an LLM trained on cognitive psychology data (e.g., Psych-101) can flag manipulative subclaims in debate is an open but empirically tractable question. By integrating empirical cognitive metrics with adversarial fine-tuning and bias estimation, this work offers a concrete path toward more robust, behaviourally aligned oversight mechanisms. 

Open Questions / Where I May Be Wrong

  • Is judge-level modelling of d′ and meta-d′ alignment-relevant?
    I’m uncertain whether modelling individual human sensitivity and metacognitive calibration (via SDT) truly contributes to alignment, or if it’s better thought of as a behavioural analysis tool—useful for evaluating oversight quality only.
  • Does fine-tuning on Psych-101 genuinely simulate deceptive behaviour?
    I’m relying on the idea that adversarial fine-tuning (Eve on Psych-101) might produce strategically persuasive arguments that exploit human cognitive biases. But it's unclear whether this setup creates outputs that actually reflect deception in a way that's alignment-relevant, or just sophisticated text.
  • Can Bob, trained on cognitive bias data, reliably detect manipulation in open-domain debate?
    This is a core assumption in the proposal, but generalisation from cognitive psychology tasks to naturalistic argumentation is unproven. There’s a real risk of both false positives (over-flagging) and false negatives (missed subtle manipulation). I’d appreciate thoughts on whether this approach is tractable.

    I welcome any critical feedback and would be excited to work collaboratively to refine or challenge these ideas. My background is in cognitive and experimental neuroscience, and I’m looking to transition into AI alignment research with a focus on scalable oversight and human–AI interaction. This post reflects that lens, bringing cognitive science tools to bear on alignment challenges like systematic human error.