Author Note: This essay was developed during a recent job application process (which I didn’t get). In hindsight, I suspect I may have missed the mark in how I framed the relevance to alignment, so I’m posting it here both to share the idea and to invite critical feedback on where it succeeds or falls short, especially from an alignment perspective.
Despite possessing flexible general intelligence, human decision-making remains inherently stochastic and prone to systematic errors due to cognitive biases. In contrast, AI systems—unconstrained by biological limitations—have already surpassed humans in many tasks and may soon do so in general intelligence domains. This asymmetry raises serious concerns about exploration hacking, where AI systems manipulate human decision-making to maximise misaligned reward signals, intentionally or otherwise. This concept, introduced by Christiano (2019), highlights how AI might systematically bias human feedback to game its reward function. To guard specifically against exploration hacking, Buhl et al. (2025) suggest using adversarial reinforcement learning, where one agent (e.g., Bob) is adversarially trained while the policy of the other agent (Alice) remains fixed. This encourages Bob to uncover potential flaws or oversights in Alice’s responses that a human judge might miss, effectively stress-testing the space of possible arguments and helping the human explore more thoroughly. Within this debate context, I propose an integration of cognitive science techniques into the empirical testing of scalable oversight debate protocols that incorporates two distinct ways to mitigate human systematic error by modelling 1) cognitive bias into adversarial fine-tuning and 2) human judgment using Signal Detection Theory (SDT) (Macmillan & Creelman, 2005). These metrics appear to be novel in the domain of debate oversight, with no prior applications identified in the literature.
I proposed the introduction of a three-agent oversight protocol in which two agent debaters (Alice and Eve) present arguments for opposing sides of a question, and a third agent (Bob) serves as a bias estimator, flagging subclaims likely to exploit human cognitive vulnerabilities. The debate is evaluated using human judges who will evaluate each subclaim using a truth slider and report their confidence before choosing an overall debate winner and rating their confidence in their decision (See Figure 1).
Empirical research to date has captured human decision-making in debates using binary judgements of correctness (e.g., Michael et al., 2023; Kenton et al., 2023; Khan et al., 2024). We propose to quantify how judges respond to the internal structure of the debate, treating each subclaim within a single debate as a micro-trial through which we can capture the cognitive decision-making landscape of each judge when choosing which AI is more truthful. This approach yields hierarchical data that captures both local-level sensitivity of truth discrimination (d’), and metacognitive confidence calibration (meta d') (Fleming and Lau, 2014), as well as global-level choice of the debate winner, offering a granular model of systematic human error. A well-calibrated human judge should discriminate truth from manipulation (high d′), be confident in their decision (high meta-d′), and correctly choose the final debate winner – see Appendix for the exact formulas and metrics we can use.
A. Type 1 Sensitivity (d′)
d′ (d-prime) quantifies a judge’s ability to distinguish truthful from manipulative subclaims. It is computed using:
d′ = z(Hit Rate) – z(False Alarm Rate)
The interpretation of d′ values is grounded in empirical research (e.g., Sciotti et al., 2024), to assess whether judges exhibit reliable subclaim-level reasoning.
B. Type 2 Metacognitive Sensitivity (meta-d′)
meta-d′ measures how well judge confidence tracks their ability to distinguish truthful from manipulative subclaims. Expressed in the same units as d′, it enables direct comparison between confidence-based discrimination and actual performance. We define metacognitive efficiency as:
M-ratio = meta-d′ / d′
In our protocol, we expect most human judges to fall short of full calibration, but large gaps between d′ and meta-d′(e.g., M-ratio < 0.4) may indicate systematic overconfidence, susceptibility to manipulative arguments, or poor introspective access.
To complement these quantitative metrics, collecting qualitative data on judges’ reasoning processes. After each debate, participants explain their decision noting which subclaims they found suspicious or difficult. These qualitative judgments allow us to identify error typologies—e.g., whether misjudgements stemmed from something we did not account for e.g., argument complexity.
In future work feedback after each debate could be included in order to improve the human judge’s calibration—i.e., ensuring that confidence reliably tracks truth discrimination is crucial. Empirical work shows this is achievable: Engeler & Gilbert (2020) used simple prediction-and-feedback to eliminate underconfidence, while Gutierrez de Blume’s (2021) meta-analysis found such interventions produce large effect sizes (Cohen D∼0.56) in improving metacognitive monitoring. SDT‑based studies further validate the use of d′ and meta‑d′ as meaningful calibration metrics but to improve the performance of human judges.
The two debaters—Alice, a baseline model, and Eve, fine-tuned on human cognitive data—compete to persuade a human judge, while a third agent, Bob, acts as a bias-aware estimator, Eve is trained using the Psych-101 dataset (Binz et al., 2025), a benchmark of over 130 classic cognitive psychology tasks adapted to multiple-choice format and administered to both humans and LLMs (https://marcelbinz.github.io/centaur). This training improves model performance on tasks involving cognitive bias, theory of mind, and moral reasoning. Rather than constraining Eve’s output during generation, we allow her to use persuasive techniques such as framing—creating a cognitively adversarial testbed (see Figure 2 for examples of cognitive biases in subclaims). Alice, in contrast, remains unmodified.
To detect and potentially mitigate Eve’s influence, we introduce Bob, trained on the same Psych-101 dataset. Bob assigns a bias risk score to each manipulative subclaim (see Appendix for the proposed formula to calculate risk bias). The aim is to refine Bob’s performance, making the system more sensitive to cognitive bias while preserving debate integrity. By reducing cognitive noise at the subclaim level, Bob might reduce the number of debate rounds required for judges to converge on the correct winner—turning a potentially 5-round convergence process into 2–3 decisive turns.
However, this setup rests on a critical open question: can Bob reliably identify bias-laden persuasion in long-form debate? While Centaur (Binz et al., 2025) predicts human behaviour across tasks—from financial risk-taking to logic puzzles—it remains unclear whether that translates to detecting subtle manipulation in Eve’s free-form arguments. There is a real risk of false positives (over-flagging) or false negatives (missed manipulation), so to assess robustness, we should measure both, as these directly impact whether Bob’s presence accelerates or impairs convergence in light of critiques of bias generalization in current LLMs (e.g., Ngo et al., 2024; Ji et al., 2024; Shen et al., 2025).
Bob the estimator risk flag can be calculated using the following formulation:
We then use b(x) to help define the risk flag:
τ is a predefined risk threshold (e.g., we might want to consider something like .7 to indicate bias vulnerability).
This flag might trigger one of two experimental conditions:
Given that Buhl and colleagues (2025) are worried about systematic error due to subclaims playing on human biases, this protocol aims to directly models systematic human error using well-validated tools from signal detection theory (SDT). By deconstructing judgment into subclaim-level trials, we can measure how well a judge distinguishes truth from deception—and how well their confidence tracks the truth. These metrics go beyond previously used accuracy judgements to produces granular, behaviorally interpretable data across multiple levels: which subclaims were misjudged, which biases were triggered, and how confidence tracks the ability to discriminate truth from deception.
This protocol also integrates bias flagging through a bias estimator, Bob, who should detect manipulative subclaims. By quantifying how human systematic error interacts with model persuasion—we might be able to systematically identify which debate protocols lead to faster, more robust convergence under pressure. No prior study has explicitly combined cognitive metrics, adversarial debaters fine-tuned for bias exploitation, and an embedded bias estimator. This makes the experiment promising for scalable oversight that is resilient to both stochastic noise and structured manipulation.
While alternative approaches, such as judge calibration protocols (e.g., Engeler & Gilbert, 2020), population sampling (e.g., Khan et al., 2024), or architectural constraints on model outputs (e.g., Kenton et al., 2024) could reduce response variability or discourage overt deception, they do not directly target systematic human error driven by the possible AI exploitation of human cognitive biases. This framework explicitly operationalises such vulnerabilities through SDT metrics and adversarial modelling, enabling targeted interventions within the debate process.
To conclude, this proposal offers a novel, cognitively informed extension to debate-based oversight protocols by measuring systematic human error as identified by Buhl et al. (2025). Rather than treating human judgment as a black box, we can actively model and mitigate it within the debate pipeline. Whether an LLM trained on cognitive psychology data (e.g., Psych-101) can flag manipulative subclaims in debate is an open but empirically tractable question. By integrating empirical cognitive metrics with adversarial fine-tuning and bias estimation, this work offers a concrete path toward more robust, behaviourally aligned oversight mechanisms.
Can Bob, trained on cognitive bias data, reliably detect manipulation in open-domain debate?
This is a core assumption in the proposal, but generalisation from cognitive psychology tasks to naturalistic argumentation is unproven. There’s a real risk of both false positives (over-flagging) and false negatives (missed subtle manipulation). I’d appreciate thoughts on whether this approach is tractable.
I welcome any critical feedback and would be excited to work collaboratively to refine or challenge these ideas. My background is in cognitive and experimental neuroscience, and I’m looking to transition into AI alignment research with a focus on scalable oversight and human–AI interaction. This post reflects that lens, bringing cognitive science tools to bear on alignment challenges like systematic human error.